<!DOCTYPE article PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Archiving and Interchange DTD v1.0 20120330//EN" "JATS-archivearticle1.dtd">
<article xmlns:xlink="http://www.w3.org/1999/xlink">
  <front>
    <journal-meta>
      <issn pub-type="ppub">1613-0073</issn>
    </journal-meta>
    <article-meta>
      <title-group>
        <article-title>Synthetically generated proofs in Propositional Logic and Language Model reasoning performance</article-title>
      </title-group>
      <contrib-group>
        <contrib contrib-type="author">
          <string-name>Anthony Marchiafava</string-name>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Atriya Sen</string-name>
          <email>asen@atriyasen.com</email>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Justin Moua</string-name>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Workshop</string-name>
        </contrib>
        <contrib contrib-type="editor">
          <string-name>Large Language Models, Synthetic Training Data, Reasoning, Machine Learning</string-name>
        </contrib>
      </contrib-group>
      <pub-date>
        <year>2025</year>
      </pub-date>
      <abstract>
        <p>Large language models (LLMs) are a powerful tool in the field of machine learning specifically and artificial intelligence broadly. These LLMs are prone to making a variety of hallucinations and mistakes in reasoning in specific. This paper shows that improving the ability to generate propositional logic proofs in specific does not lead to greater performance on reasoning tasks in general. This is done by evaluating models fine-tuned on a synthetically generated proof by contradiction dataset on a variety of reasoning based benchmarks.</p>
      </abstract>
    </article-meta>
  </front>
  <body>
    <sec id="sec-1">
      <title>1. Introduction</title>
      <p>
        The capacity to use and apply logic has been an important task in artificial intelligence broadly. Recent
work in machine learning broadly and using large language models in specific is concerned with
improving reasoning performance. Large Language Models (LLMs) are a powerful tool in artificial
intelligence but these generate text that is prone to a variety of mistakes and hallucinations in general
but exhibit mistakes when it comes to reasoning in specific [
        <xref ref-type="bibr" rid="ref1">1</xref>
        ] [
        <xref ref-type="bibr" rid="ref2">2</xref>
        ]. Increasing the trustworthiness of
these models is a meaningful goal. To evaluate this, a number of problems can be imagined as reasoning
problems in a variety of types of logic. One of the simplest of these types of logic would be propositional
logic where knowledge is encoded as sets of statements combined together with logical connectives [
        <xref ref-type="bibr" rid="ref3">3</xref>
        ].
One can evaluate more sophisticated logics such as first-order logic [
        <xref ref-type="bibr" rid="ref4">4</xref>
        ] or train a LLM for mathematical
tasks including higher order logics [
        <xref ref-type="bibr" rid="ref5">5</xref>
        ]. One issue with these LLM approaches is that they can create
false or hallucinated results.
      </p>
      <p>Automated theorem provers are powerful tools to algorithmically produce step by step proofs that
are able to be validated. These automated reasoners generate proofs are used as a sort of correct
computational reasoning to show that some set of axioms deductively entail some set of conclusions.
These automated theorem provers can be used to algorithmically create large amounts of synthetic data
that a LLM can learn from and can be tailored to focus in on the types of logic that may be of interest.
In this paper the sort of logic used is propositional, and an automated theorem prover can be used to
produce valid proofs that a LLM can learn from.</p>
    </sec>
    <sec id="sec-2">
      <title>2. Related Works</title>
      <p>
        The large language models have been used for generating proofs before. Either in the production of
proof steps [
        <xref ref-type="bibr" rid="ref6">6</xref>
        ] or to generate whole proofs generation in higher order logic [
        <xref ref-type="bibr" rid="ref7">7</xref>
        ]. Proof assistants such
as Isabelle [
        <xref ref-type="bibr" rid="ref8">8</xref>
        ] for higher order logic were used to improve LLM performance on proof tasks like in
[
        <xref ref-type="bibr" rid="ref7">7</xref>
        ]. Further work has built on using automated theorem provers to assist in proof generation through
the use of tools which can do parts of the proof outside of the LLM [
        <xref ref-type="bibr" rid="ref9">9</xref>
        ]. Work has also been done in
https://www.atriyasen.com/ (A. Sen)
      </p>
      <p>CEUR</p>
      <p>
        ceur-ws.org
identifying reasoning errors and consistent mistakes in LLMs such as issues with in context learning
[
        <xref ref-type="bibr" rid="ref2">2</xref>
        ].
      </p>
    </sec>
    <sec id="sec-3">
      <title>3. Methodology</title>
      <p>
        Training a large language model from start to finish is a computationally expensive and lengthy process
[
        <xref ref-type="bibr" rid="ref10">10</xref>
        ]. Instead it is eficient to take an already existing foundational model and adjust this model to
perform well on a specialized task through fine-tuning. The approach taken here is to test if training a
specialized model on a propositional logic proof generation extends to better performance on a wider
variety of reasoning tasks. To achieve this a set of propositional logic statements were generated, then
input into an automated theorem prover which generated proofs, these proofs were then converted
into natural language, and these natural language statements were used to fine-tune a pretrained large
language model. The now fine-tuned large language model was evaluated on a number of benchmark
reasoning tasks and these benchmarks were then compared to the performance of the pre-fine-tuned
equivalent model.
      </p>
      <sec id="sec-3-1">
        <title>3.1. Propositional Dataset Generation</title>
        <p>
          Synthetically generating a large number of propositional logic statements was done algorithmically.
The dataset was generated using a normal grammar in conjunctive normal form (CNF).
Propositional logic statements can be expressed as combinations of conjunctions, where the conjunctions
are combinations of conjunctions ’and’ed with disjunctions, and disjunctions are the combination
of disjunctions ’or’ed with literals [
          <xref ref-type="bibr" rid="ref11">11</xref>
          ]. Here the dataset was generated with four literals and their
negations , , , , ¬, ¬, ¬, ¬ . These literals were then combined together to form disjunctions such
as  ∨  , and these disjunctions were combined together to form conjunctions such as ( ∨ ) ∧  . This set of
CNF statements were then treated first as a list of axioms and then as a separate list of conjectures. Each
axiom in the list of axioms was then paired with each conjecture (where the axiom and the conjecture
were not the same statement) in the Thousand Problems for Theorem Provers (TPTP) syntax [
          <xref ref-type="bibr" rid="ref12">12</xref>
          ]. Then
the list of combined axiom and conjecture was shufled. This follows a similar pattern to previous work
where a synthetically generated dataset was used to train a classifier to diferentiate between logic
statements whose proofs were found or if there was no proof that was detected [
          <xref ref-type="bibr" rid="ref13">13</xref>
          ].
        </p>
        <p>
          An individual combination of axiom and conjecture were then input into the automated theorem
prover E which produced a proof showing if the axioms entailed the conjecture [
          <xref ref-type="bibr" rid="ref14">14</xref>
          ]. This process of
feeding input into the automated theorem prover continued until a number, here 10,000, suitable proofs
were produced. Each of these saved proofs showed that the conjecture was entailed by the axiom and
would be used for fine-tuning while the results which showed there was no entailment were not used.
        </p>
        <p>The E theorem prover utilized proof by contradiction solve to produce the proofs which were used to
create the synthetic dataset here. This process begins by asserting the negation of the conclusion and
attempting to derive a contradiction. If the negation of the conclusion entails a contradiction then the
original conclusion is entailed. The E theorem prover created these proofs algorithmically and reports
them instruction by instruction. These results were then converted into a natural language equivalent.</p>
      </sec>
      <sec id="sec-3-2">
        <title>3.2. Natural Language Equivalents</title>
        <p>Obtaining the natural language equivalence of the dataset was also done algorithmically. It does so
by analyzing each problem’s structure of various components like axioms, conjectures, refutation
steps, inferences, and its label. Each component has their own sub-components acting as describable
attributes. They are essential for the automated theorem prover E to algorithmically solve given
problems. However, some were not necessary in obtaining the natural language equivalence of the
results. Instead, core components were obtained that were used to achieve the result. As an example,
a problem containing the axiom () ∧ (¬ ∨ ¬) with conjecture (¬ ∨ ) ∧ ( ∨ ) consists of multiple
refutation steps which could consist of a single inference or nested inferences. Its first refutation step
would have two inferences where the first is to perform an operation of assuming the negation of the
conjecture. Then, the second inference would be to simplify it. This produced the sub-result ¬((¬a ∨b)
∧(b ∨c)) which the second refutation step takes in. This operation continues until the last refutation step
is reached. Throughout the entire process, the algorithm checks for particular key words that indicates
what operations were performed on each step and the derived sub-results and final result. For each
check that is made, a natural language equivalent is created and stored.</p>
      </sec>
      <sec id="sec-3-3">
        <title>3.3. Large Language Model Fine Tuning</title>
        <p>
          The seven billion parameter version Qwen2 for instruction behavior was chosen as the starting point
for fine-tuning. Quen2 is a set of foundational models where the most complex version with 72 billion
parameters performs well on multiple varieties of tasks [
          <xref ref-type="bibr" rid="ref15">15</xref>
          ]. Qwen2 was specifically chosen for this
purpose due to its open source nature and the multiple versions with difering parameter sizes. Hardware
and computational limitations prevented the inclusion of larger models. However, should performance
improve on the smaller models that would give evidence that further performance gains may be available
on larger instances of those same models.
        </p>
        <p>
          Low-Rank Adaptation (LORA) of large language models is a technique for training models in such
a way that it decreases the number of trainable parameters through reparameterization [
          <xref ref-type="bibr" rid="ref16">16</xref>
          ]. This
allows larger models to be efectively trained on reduced hardware and is a type of parameter-eficient
ifne-tuning [
          <xref ref-type="bibr" rid="ref17">17</xref>
          ]. LORA was used here to eficiently fine-tune Quen2-7B-Instruct on the dataset of proofs
converted to natural language. These were done using the Transformers library [
          <xref ref-type="bibr" rid="ref18">18</xref>
          ] and
LLAMAFactory [
          <xref ref-type="bibr" rid="ref19">19</xref>
          ]. The model was trained for 3 epoch with a learning rate of 1.0 − 4 , a batch size of 1, and
using PyTorch’s ADAMW optimizer [
          <xref ref-type="bibr" rid="ref20">20</xref>
          ] with  = (0.9, 0.999) and  = 1 − 08 . When all was complete
these models were considered fine-tuned and suitable for evaluation on the chosen benchmark tasks.
        </p>
      </sec>
    </sec>
    <sec id="sec-4">
      <title>4. Evaluation Benchmarks</title>
      <p>
        The performance of these fine-tuned models is compared against the performance of the unspecialized
language models on which they are based. Reasoning can be construed broadly from the more direct
logic proof generation this paper used for training to things such as mathematical reasoning, mechanical
reasoning, and spacial reasoning among many others. To illustrate this broad perspective this paper’s
benchmarks were chosen to model a wide variety of reasoning tasks. These were the Grade School
Math 8K (GSM8K) [
        <xref ref-type="bibr" rid="ref21">21</xref>
        ], Massive Multitask Language Understanding-Redux (MMLU-Redux) [
        <xref ref-type="bibr" rid="ref22">22</xref>
        ], Code
Reasoning, Understanding, and eXecution Evaluation (CRUXEval) [
        <xref ref-type="bibr" rid="ref23">23</xref>
        ], and ZebraLogic [
        <xref ref-type="bibr" rid="ref24">24</xref>
        ] benchmark
tasks. ZeroEval [
        <xref ref-type="bibr" rid="ref25">25</xref>
        ] was used as the framework for evaluating the various models performance on
these tasks. Two variations of the Qwen2 model were chosen to evaluate the performance of this
paper’s approach; both are instruct versions which are designed to follow prompted instructions. The
smaller version is Qwen2 1.5B-Instruct which has roughly 1.5 billion parameters and the larger version
is Qwen2 7B-Instruct which has roughly 7 billion parameters. These were chosen due to hardware
constraints and are suitable for fine-tuning. This approach can scale to diferent language models or
larger models with more parameters without major alteration.
      </p>
      <sec id="sec-4-1">
        <title>4.1. Grade School Math 8K</title>
        <p>
          GSM8K is a collection of mathematical reasoning problems designed to be solvable with simple arithmetic
operations. Each problem takes two to eight steps to solve and are designed not to follow a template for
problem construction [
          <xref ref-type="bibr" rid="ref21">21</xref>
          ]. These problems also require an understanding of words and their meanings
which difer from the purely symbolic fine-tuning data that was used for this paper’s methodology.
        </p>
      </sec>
      <sec id="sec-4-2">
        <title>4.2. Massive Multitask Language Understanding-Redux</title>
        <p>
          MMLU-Redux is a curated subset of MMLU [
          <xref ref-type="bibr" rid="ref26">26</xref>
          ] which is large dataset of questions that span a variety
of topics. MMLU-Redux includes 5,700 manually re-annotated problems which span 57 topic areas
including logical fallacies, formal logic, economics, history, and medicine [
          <xref ref-type="bibr" rid="ref22">22</xref>
          ]. This benchmark requires
the language model to internalize a large amount of background information while also including many
problems which require formal reasoning skills.
        </p>
      </sec>
      <sec id="sec-4-3">
        <title>4.3. Code Reasoning, Understanding, and eXecution Evaluation</title>
        <p>
          CRUXEval is a benchmark dataset made up of 800 Python functions and input-output pairs [
          <xref ref-type="bibr" rid="ref23">23</xref>
          ]. Each
problem can either specify an input and the function and request the language model to produce the
correct output or request the language model to produce what input would produce the specified output
from the function. Answering these questions indicates how the language model understands and
reasons about programming constructs and the logic necessary to evaluate provided functions.
        </p>
      </sec>
      <sec id="sec-4-4">
        <title>4.4. ZebraLogic</title>
        <p>
          Zebra Logic is benchmark made up of Constraint satisfaction problems [
          <xref ref-type="bibr" rid="ref24">24</xref>
          ]. The language model is
provided a set of houses and a set of owners. The model is also given some constraints about which
house has what items, which people own which items, and some information relating one item to
another house. The final goal being the matching of each house to each owner. These problems require
the language model to identify a complex relationship of facts, a large number of constraints, and apply
some logic to find the solution.
        </p>
      </sec>
    </sec>
    <sec id="sec-5">
      <title>5. Results</title>
      <p>Overall using synthetically generated proofs by contradiction does not improve model performance.
To some degree the specialization can decrease performance for smaller models without substantially
improving performance for larger models. For each case the number of non-answers is the number of
questions that did not include suficient information to check if an answer is correct or not divided by
the total number of questions. The accuracy of each example is the total number of correctly answered
questions divided by the total number of overall questions. In the case of GSM8K the base models
outperformed this paper’s fine-tuned models either by 1.97 for the larger models or a diference as great
as 20.92 for the smaller models as seen in Table 1.</p>
      <p>The number of No Answers is similar when comparing both versions of models. For the
MMLURedux problems the fine-tuned model again underperformed the base models with a slight diference of
only 0.54 for the larger models as seen in Table2. The number of No Answers is similarly close. As
for CRUXEval task the results are similar with only a diference of 1 between the two larger models
accuracy as seen in Table3. ZebraLogic also has a similar result where the base model outperforms the
ifne-tuned model by the small amount of 1.3 as seen in Table4.</p>
    </sec>
    <sec id="sec-6">
      <title>6. Conclusion</title>
      <p>Learning propositional logic proofs based on contradiction does not suficiently translate to overall
reasoning skills. This is not an entirely unexpected result as many of the benchmark tasks require
the use of first order logic which is a more sophisticated than the propositional logic used to train the
ifne-tuned models here.</p>
    </sec>
    <sec id="sec-7">
      <title>Declaration on Generative AI</title>
      <p>The core methodology of this research involved the use of Qwen for fine tuning and answering
reasoning based question benchmarks, as described in the paper. No generative AI tools were
employed for the writing or editing of this manuscript beyond the research itself. The author takes full
responsibility for the publication’s content.</p>
    </sec>
    <sec id="sec-8">
      <title>A. Training Data Example</title>
      <p>Problem: () ∧ (¬ ∨ ¬) ⊨ (¬ ∨ ) ∧ ( ∨ )
System Message: You are Qwen, created by Alibaba Cloud. You are a helpful assistant.
Input: Given axiom axio(b)( b| c), can you prove the conjecture conj( a|b)(b|c)?
Output: step 1) Begin by assuming the negation of the conjecture which is simplified to ¬(((¬a|b)∧(b|c))).
\nstep 2) Converting the result of step 1 to negated normal form Converting the result of step 1 to
negated normal form which is distributed to (((¬b|a)∧(¬c|a))∧((¬b|¬b)∧(¬c|¬b))). \nstep 3) Applying
simplification to the result of step 2 provides us with (b∧(¬b|¬c)). \step 4) Using split conjunction on to
the result of step 3 produces (¬b|¬b). \nstep 5) Applying negated normal form to the result of step 4
renders (b∧(¬b|¬c)). \step 6) Now we apply clause normalization to step 5 results in (¬b). \nstep 7)
Using split conjunction on to the result of step 6 produces (b). \nstep 8) Given step 6 produced (¬b) and
step 7 (b), we have derived a contradiction.</p>
    </sec>
  </body>
  <back>
    <ref-list>
      <ref id="ref1">
        <mixed-citation>
          [1]
          <string-name>
            <given-names>Y.</given-names>
            <surname>Wan</surname>
          </string-name>
          ,
          <string-name>
            <given-names>W.</given-names>
            <surname>Wang</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Y.</given-names>
            <surname>Yang</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Y.</given-names>
            <surname>Yuan</surname>
          </string-name>
          , J.-t. Huang,
          <string-name>
            <given-names>P.</given-names>
            <surname>He</surname>
          </string-name>
          ,
          <string-name>
            <given-names>W.</given-names>
            <surname>Jiao</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M. R.</given-names>
            <surname>Lyu</surname>
          </string-name>
          ,
          <article-title>Logicasker: Evaluating and improving the logical reasoning ability of large language models</article-title>
          ,
          <source>arXiv preprint arXiv:2401.00757</source>
          (
          <year>2024</year>
          ).
        </mixed-citation>
      </ref>
      <ref id="ref2">
        <mixed-citation>
          [2]
          <string-name>
            <given-names>P.</given-names>
            <surname>Han</surname>
          </string-name>
          ,
          <string-name>
            <surname>P</surname>
          </string-name>
          . Song,
          <string-name>
            <given-names>H.</given-names>
            <surname>Yu</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>You</surname>
          </string-name>
          ,
          <article-title>In-context learning may not elicit trustworthy reasoning: A-not-b errors in pretrained language models</article-title>
          ,
          <source>arXiv preprint arXiv:2409.15454</source>
          (
          <year>2024</year>
          ).
        </mixed-citation>
      </ref>
      <ref id="ref3">
        <mixed-citation>
          [3]
          <string-name>
            <given-names>C.</given-names>
            <surname>Franks</surname>
          </string-name>
          , Propositional Logic, in: E. N.
          <string-name>
            <surname>Zalta</surname>
          </string-name>
          , U. Nodelman (Eds.),
          <source>The Stanford Encyclopedia of Philosophy</source>
          , Winter 2024 ed., Metaphysics Research Lab, Stanford University,
          <year>2024</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref4">
        <mixed-citation>
          [4]
          <string-name>
            <given-names>S.</given-names>
            <surname>Han</surname>
          </string-name>
          ,
          <string-name>
            <given-names>H.</given-names>
            <surname>Schoelkopf</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Y.</given-names>
            <surname>Zhao</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Z.</given-names>
            <surname>Qi</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Riddell</surname>
          </string-name>
          ,
          <string-name>
            <given-names>W.</given-names>
            <surname>Zhou</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Coady</surname>
          </string-name>
          ,
          <string-name>
            <given-names>D.</given-names>
            <surname>Peng</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Y.</given-names>
            <surname>Qiao</surname>
          </string-name>
          ,
          <string-name>
            <given-names>L.</given-names>
            <surname>Benson</surname>
          </string-name>
          ,
          <string-name>
            <given-names>L.</given-names>
            <surname>Sun</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Wardle-Solano</surname>
          </string-name>
          ,
          <string-name>
            <given-names>H.</given-names>
            <surname>Szabó</surname>
          </string-name>
          , E. Zubova,
          <string-name>
            <given-names>M.</given-names>
            <surname>Burtell</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Fan</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Y.</given-names>
            <surname>Liu</surname>
          </string-name>
          ,
          <string-name>
            <given-names>B.</given-names>
            <surname>Wong</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Sailor</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Ni</surname>
          </string-name>
          ,
          <string-name>
            <given-names>L.</given-names>
            <surname>Nan</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Kasai</surname>
          </string-name>
          ,
          <string-name>
            <given-names>T.</given-names>
            <surname>Yu</surname>
          </string-name>
          ,
          <string-name>
            <given-names>R.</given-names>
            <surname>Zhang</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Fabbri</surname>
          </string-name>
          ,
          <string-name>
            <given-names>W. M.</given-names>
            <surname>Kryscinski</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Yavuz</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Y.</given-names>
            <surname>Liu</surname>
          </string-name>
          ,
          <string-name>
            <given-names>X. V.</given-names>
            <surname>Lin</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Joty</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Y.</given-names>
            <surname>Zhou</surname>
          </string-name>
          ,
          <string-name>
            <given-names>C.</given-names>
            <surname>Xiong</surname>
          </string-name>
          ,
          <string-name>
            <given-names>R.</given-names>
            <surname>Ying</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Cohan</surname>
          </string-name>
          ,
          <string-name>
            <given-names>D.</given-names>
            <surname>Radev</surname>
          </string-name>
          ,
          <article-title>FOLIO: Natural language reasoning with first-order logic</article-title>
          , in: Y.
          <string-name>
            <surname>Al-Onaizan</surname>
            ,
            <given-names>M.</given-names>
          </string-name>
          <string-name>
            <surname>Bansal</surname>
            ,
            <given-names>Y.-N.</given-names>
          </string-name>
          <string-name>
            <surname>Chen</surname>
          </string-name>
          (Eds.),
          <source>Proceedings of the 2024 Conference on Empirical Methods in Natural Language Processing</source>
          , Association for Computational Linguistics, Miami, Florida, USA,
          <year>2024</year>
          , pp.
          <fpage>22017</fpage>
          -
          <lpage>22031</lpage>
          . URL: https://aclanthology.org/
          <year>2024</year>
          .emnlp-main.
          <volume>1229</volume>
          /. doi:
          <volume>10</volume>
          .18653/v1/
          <year>2024</year>
          .emnlp-main.
          <volume>1229</volume>
          .
        </mixed-citation>
      </ref>
      <ref id="ref5">
        <mixed-citation>
          [5]
          <string-name>
            <given-names>Z.</given-names>
            <surname>Azerbayev</surname>
          </string-name>
          ,
          <string-name>
            <given-names>H.</given-names>
            <surname>Schoelkopf</surname>
          </string-name>
          ,
          <string-name>
            <given-names>K.</given-names>
            <surname>Paster</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M. D.</given-names>
            <surname>Santos</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S. M.</given-names>
            <surname>McAleer</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A. Q.</given-names>
            <surname>Jiang</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Deng</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Biderman</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Welleck</surname>
          </string-name>
          ,
          <string-name>
            <surname>Llemma:</surname>
          </string-name>
          <article-title>An open language model for mathematics</article-title>
          ,
          <source>in: The Twelfth International Conference on Learning Representations</source>
          ,
          <year>2024</year>
          . URL: https://openreview.net/forum?id=4WnqRR915j.
        </mixed-citation>
      </ref>
      <ref id="ref6">
        <mixed-citation>
          [6]
          <string-name>
            <given-names>S.</given-names>
            <surname>Polu</surname>
          </string-name>
          ,
          <string-name>
            <surname>I. Sutskever</surname>
          </string-name>
          ,
          <article-title>Generative language modeling for automated theorem proving</article-title>
          ,
          <year>2020</year>
          . URL: https://arxiv.org/abs/
          <year>2009</year>
          .03393. arXiv:
          <year>2009</year>
          .03393.
        </mixed-citation>
      </ref>
      <ref id="ref7">
        <mixed-citation>
          [7]
          <string-name>
            <given-names>E.</given-names>
            <surname>First</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M. N.</given-names>
            <surname>Rabe</surname>
          </string-name>
          ,
          <string-name>
            <given-names>T.</given-names>
            <surname>Ringer</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Y.</given-names>
            <surname>Brun</surname>
          </string-name>
          , Baldur:
          <article-title>Whole-proof generation and repair with large language models</article-title>
          ,
          <source>in: Proceedings of the 31st ACM Joint European Software Engineering Conference and Symposium on the Foundations of Software Engineering</source>
          , ESEC/FSE 2023,
          <article-title>Association for Computing Machinery</article-title>
          , New York, NY, USA,
          <year>2023</year>
          , p.
          <fpage>1229</fpage>
          -
          <lpage>1241</lpage>
          . URL: https://doi.org/10.1145/3611643.3616243. doi:
          <volume>10</volume>
          .1145/3611643.3616243.
        </mixed-citation>
      </ref>
      <ref id="ref8">
        <mixed-citation>
          [8]
          <string-name>
            <given-names>T.</given-names>
            <surname>Nipkow</surname>
          </string-name>
          ,
          <string-name>
            <given-names>L. C.</given-names>
            <surname>Paulson</surname>
          </string-name>
          , M. Wenzel, Isabelle/HOL - A
          <string-name>
            <surname>Proof Assistant for Higher-Order</surname>
            <given-names>Logic</given-names>
          </string-name>
          , volume
          <volume>2283</volume>
          <source>of LNCS</source>
          , Springer,
          <year>2002</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref9">
        <mixed-citation>
          [9]
          <string-name>
            <given-names>A. Q.</given-names>
            <surname>Jiang</surname>
          </string-name>
          ,
          <string-name>
            <given-names>W.</given-names>
            <surname>Li</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Tworkowski</surname>
          </string-name>
          ,
          <string-name>
            <given-names>K.</given-names>
            <surname>Czechowski</surname>
          </string-name>
          ,
          <string-name>
            <given-names>T.</given-names>
            <surname>Odrzygóźdź</surname>
          </string-name>
          ,
          <string-name>
            <given-names>P.</given-names>
            <surname>Miłoś</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Y.</given-names>
            <surname>Wu</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Jamnik</surname>
          </string-name>
          , Thor:
          <article-title>Wielding hammers to integrate language models and automated theorem provers</article-title>
          ,
          <source>Advances in Neural Information Processing Systems</source>
          <volume>35</volume>
          (
          <year>2022</year>
          )
          <fpage>8360</fpage>
          -
          <lpage>8373</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref10">
        <mixed-citation>
          [10]
          <string-name>
            <given-names>A.</given-names>
            <surname>Singh</surname>
          </string-name>
          ,
          <string-name>
            <given-names>N. P.</given-names>
            <surname>Patel</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Ehtesham</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Kumar</surname>
          </string-name>
          ,
          <string-name>
            <given-names>T. T.</given-names>
            <surname>Khoei</surname>
          </string-name>
          ,
          <article-title>A survey of sustainability in large language models: Applications, economics, and challenges</article-title>
          ,
          <source>in: 2025 IEEE 15th Annual Computing and Communication Workshop and Conference (CCWC)</source>
          ,
          <year>2025</year>
          , pp.
          <fpage>00008</fpage>
          -
          <lpage>00014</lpage>
          . doi:
          <volume>10</volume>
          .1109/ CCWC62904.
          <year>2025</year>
          .
          <volume>10903774</volume>
          .
        </mixed-citation>
      </ref>
      <ref id="ref11">
        <mixed-citation>
          [11]
          <string-name>
            <given-names>E.</given-names>
            <surname>Rich</surname>
          </string-name>
          , Automata,
          <source>Computability and Complexity: Theory and Applications</source>
          , Pearson Prentice Hall,
          <year>2008</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref12">
        <mixed-citation>
          [12]
          <string-name>
            <surname>G. Sutclife,</surname>
          </string-name>
          <article-title>The tptp problem library and associated infrastructure</article-title>
          ,
          <source>J. Autom. Reason</source>
          .
          <volume>59</volume>
          (
          <year>2017</year>
          )
          <fpage>483</fpage>
          -
          <lpage>502</lpage>
          . URL: https://doi.org/10.1007/s10817-017-9407-7. doi:
          <volume>10</volume>
          .1007/s10817-017-9407-7.
        </mixed-citation>
      </ref>
      <ref id="ref13">
        <mixed-citation>
          [13]
          <string-name>
            <given-names>A.</given-names>
            <surname>Marchiafava</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Sen</surname>
          </string-name>
          ,
          <article-title>Towards determining how deep neural models learn to reason</article-title>
          , in: 2025 AAAI-Make Spring Symposium Series,
          <source>Association for the Advancement of Artificial Intelligence</source>
          ,
          <year>2025</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref14">
        <mixed-citation>
          [14]
          <string-name>
            <given-names>S.</given-names>
            <surname>Schulz</surname>
          </string-name>
          ,
          <string-name>
            <surname>E -</surname>
          </string-name>
          <article-title>a brainiac theorem prover</article-title>
          ,
          <source>Journal of AI Communications</source>
          <volume>15</volume>
          (
          <year>2002</year>
          )
          <fpage>111</fpage>
          -
          <lpage>126</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref15">
        <mixed-citation>
          [15]
          <string-name>
            <given-names>A.</given-names>
            <surname>Yang</surname>
          </string-name>
          ,
          <string-name>
            <given-names>B.</given-names>
            <surname>Yang</surname>
          </string-name>
          ,
          <string-name>
            <given-names>B.</given-names>
            <surname>Hui</surname>
          </string-name>
          ,
          <string-name>
            <given-names>B.</given-names>
            <surname>Zheng</surname>
          </string-name>
          ,
          <string-name>
            <given-names>B.</given-names>
            <surname>Yu</surname>
          </string-name>
          ,
          <string-name>
            <given-names>C.</given-names>
            <surname>Zhou</surname>
          </string-name>
          ,
          <string-name>
            <given-names>C.</given-names>
            <surname>Li</surname>
          </string-name>
          ,
          <string-name>
            <given-names>C.</given-names>
            <surname>Li</surname>
          </string-name>
          ,
          <string-name>
            <given-names>D.</given-names>
            <surname>Liu</surname>
          </string-name>
          ,
          <string-name>
            <given-names>F.</given-names>
            <surname>Huang</surname>
          </string-name>
          , G. Dong,
          <string-name>
            <given-names>H.</given-names>
            <surname>Wei</surname>
          </string-name>
          ,
          <string-name>
            <given-names>H.</given-names>
            <surname>Lin</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Tang</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Wang</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Yang</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Tu</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Zhang</surname>
          </string-name>
          , J. Ma, J. Xu,
          <string-name>
            <given-names>J.</given-names>
            <surname>Zhou</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Bai</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>He</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Lin</surname>
          </string-name>
          ,
          <string-name>
            <given-names>K.</given-names>
            <surname>Dang</surname>
          </string-name>
          ,
          <string-name>
            <given-names>K.</given-names>
            <surname>Lu</surname>
          </string-name>
          ,
          <string-name>
            <given-names>K.</given-names>
            <surname>Chen</surname>
          </string-name>
          ,
          <string-name>
            <given-names>K.</given-names>
            <surname>Yang</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Li</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Xue</surname>
          </string-name>
          ,
          <string-name>
            <given-names>N.</given-names>
            <surname>Ni</surname>
          </string-name>
          ,
          <string-name>
            <given-names>P.</given-names>
            <surname>Zhang</surname>
          </string-name>
          ,
          <string-name>
            <given-names>P.</given-names>
            <surname>Wang</surname>
          </string-name>
          ,
          <string-name>
            <given-names>R.</given-names>
            <surname>Peng</surname>
          </string-name>
          ,
          <string-name>
            <given-names>R.</given-names>
            <surname>Men</surname>
          </string-name>
          ,
          <string-name>
            <given-names>R.</given-names>
            <surname>Gao</surname>
          </string-name>
          ,
          <string-name>
            <given-names>R.</given-names>
            <surname>Lin</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Wang</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Bai</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Tan</surname>
          </string-name>
          ,
          <string-name>
            <given-names>T.</given-names>
            <surname>Zhu</surname>
          </string-name>
          ,
          <string-name>
            <given-names>T.</given-names>
            <surname>Li</surname>
          </string-name>
          ,
          <string-name>
            <given-names>T.</given-names>
            <surname>Liu</surname>
          </string-name>
          ,
          <string-name>
            <given-names>W.</given-names>
            <surname>Ge</surname>
          </string-name>
          ,
          <string-name>
            <given-names>X.</given-names>
            <surname>Deng</surname>
          </string-name>
          ,
          <string-name>
            <given-names>X.</given-names>
            <surname>Zhou</surname>
          </string-name>
          ,
          <string-name>
            <given-names>X.</given-names>
            <surname>Ren</surname>
          </string-name>
          ,
          <string-name>
            <given-names>X.</given-names>
            <surname>Zhang</surname>
          </string-name>
          ,
          <string-name>
            <given-names>X.</given-names>
            <surname>Wei</surname>
          </string-name>
          ,
          <string-name>
            <given-names>X.</given-names>
            <surname>Ren</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Y.</given-names>
            <surname>Fan</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Y.</given-names>
            <surname>Yao</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Y.</given-names>
            <surname>Zhang</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Y.</given-names>
            <surname>Wan</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Y.</given-names>
            <surname>Chu</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Y.</given-names>
            <surname>Liu</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Z.</given-names>
            <surname>Cui</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Z.</given-names>
            <surname>Zhang</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Z.</given-names>
            <surname>Fan</surname>
          </string-name>
          ,
          <source>Qwen2 technical report, arXiv preprint arXiv:2407.10671</source>
          (
          <year>2024</year>
          ).
        </mixed-citation>
      </ref>
      <ref id="ref16">
        <mixed-citation>
          [16]
          <string-name>
            <given-names>E. J.</given-names>
            <surname>Hu</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Y.</given-names>
            <surname>Shen</surname>
          </string-name>
          ,
          <string-name>
            <given-names>P.</given-names>
            <surname>Wallis</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Z.</given-names>
            <surname>Allen-Zhu</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Y.</given-names>
            <surname>Li</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Wang</surname>
          </string-name>
          ,
          <string-name>
            <given-names>L.</given-names>
            <surname>Wang</surname>
          </string-name>
          ,
          <string-name>
            <given-names>W.</given-names>
            <surname>Chen</surname>
          </string-name>
          , Lora:
          <article-title>Low-rank adaptation of large language models</article-title>
          ,
          <year>2021</year>
          . URL: https://arxiv.org/abs/2106.09685. arXiv:
          <volume>2106</volume>
          .
          <fpage>09685</fpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref17">
        <mixed-citation>
          [17]
          <string-name>
            <given-names>N.</given-names>
            <surname>Ding</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Y.</given-names>
            <surname>Qin</surname>
          </string-name>
          ,
          <string-name>
            <given-names>G.</given-names>
            <surname>Yang</surname>
          </string-name>
          ,
          <string-name>
            <given-names>F.</given-names>
            <surname>Wei</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Y.</given-names>
            <surname>Zonghan</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Y.</given-names>
            <surname>Su</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Hu</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Y.</given-names>
            <surname>Chen</surname>
          </string-name>
          ,
          <string-name>
            <surname>C.-M. Chan</surname>
            ,
            <given-names>W.</given-names>
          </string-name>
          <string-name>
            <surname>Chen</surname>
            ,
            <given-names>J.</given-names>
          </string-name>
          <string-name>
            <surname>Yi</surname>
            ,
            <given-names>W.</given-names>
          </string-name>
          <string-name>
            <surname>Zhao</surname>
            ,
            <given-names>X.</given-names>
          </string-name>
          <string-name>
            <surname>Wang</surname>
            ,
            <given-names>Z.</given-names>
          </string-name>
          <string-name>
            <surname>Liu</surname>
            , H.-T. Zheng,
            <given-names>J.</given-names>
          </string-name>
          <string-name>
            <surname>Chen</surname>
            ,
            <given-names>Y.</given-names>
          </string-name>
          <string-name>
            <surname>Liu</surname>
            ,
            <given-names>J.</given-names>
          </string-name>
          <string-name>
            <surname>Tang</surname>
            ,
            <given-names>J.</given-names>
          </string-name>
          <string-name>
            <surname>Li</surname>
            ,
            <given-names>M.</given-names>
          </string-name>
          <string-name>
            <surname>Sun</surname>
          </string-name>
          ,
          <article-title>Parameter-eficient ifne-tuning of large-scale pre-trained language models</article-title>
          ,
          <source>Nature Machine Intelligence</source>
          <volume>5</volume>
          (
          <year>2023</year>
          )
          <fpage>1</fpage>
          -
          <lpage>16</lpage>
          . doi:
          <volume>10</volume>
          .1038/s42256- 023- 00626- 4.
        </mixed-citation>
      </ref>
      <ref id="ref18">
        <mixed-citation>
          [18]
          <string-name>
            <given-names>T.</given-names>
            <surname>Wolf</surname>
          </string-name>
          ,
          <string-name>
            <given-names>L.</given-names>
            <surname>Debut</surname>
          </string-name>
          ,
          <string-name>
            <given-names>V.</given-names>
            <surname>Sanh</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Chaumond</surname>
          </string-name>
          ,
          <string-name>
            <given-names>C.</given-names>
            <surname>Delangue</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Moi</surname>
          </string-name>
          ,
          <string-name>
            <given-names>P.</given-names>
            <surname>Cistac</surname>
          </string-name>
          ,
          <string-name>
            <given-names>T.</given-names>
            <surname>Rault</surname>
          </string-name>
          ,
          <string-name>
            <given-names>R.</given-names>
            <surname>Louf</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Funtowicz</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Davison</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Shleifer</surname>
          </string-name>
          , P. von Platen, C. Ma,
          <string-name>
            <given-names>Y.</given-names>
            <surname>Jernite</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Plu</surname>
          </string-name>
          ,
          <string-name>
            <given-names>C.</given-names>
            <surname>Xu</surname>
          </string-name>
          ,
          <string-name>
            <given-names>T. L.</given-names>
            <surname>Scao</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Gugger</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Drame</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Q.</given-names>
            <surname>Lhoest</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A. M.</given-names>
            <surname>Rush</surname>
          </string-name>
          ,
          <article-title>Huggingface's transformers: State-of-the-art natural language processing</article-title>
          ,
          <year>2020</year>
          . URL: https://arxiv.org/abs/
          <year>1910</year>
          .03771. arXiv:
          <year>1910</year>
          .03771.
        </mixed-citation>
      </ref>
      <ref id="ref19">
        <mixed-citation>
          [19]
          <string-name>
            <given-names>Y.</given-names>
            <surname>Zheng</surname>
          </string-name>
          ,
          <string-name>
            <given-names>R.</given-names>
            <surname>Zhang</surname>
          </string-name>
          , J. Zhang,
          <string-name>
            <given-names>Y.</given-names>
            <surname>Ye</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Z.</given-names>
            <surname>Luo</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Z.</given-names>
            <surname>Feng</surname>
          </string-name>
          , Y. Ma, Llamafactory:
          <article-title>Unified eficient finetuning of 100+ language models</article-title>
          ,
          <source>in: Proceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (Volume</source>
          <volume>3</volume>
          :
          <string-name>
            <surname>System</surname>
            <given-names>Demonstrations)</given-names>
          </string-name>
          ,
          <source>Association for Computational Linguistics</source>
          , Bangkok, Thailand,
          <year>2024</year>
          . URL: http://arxiv.org/abs/2403.13372.
        </mixed-citation>
      </ref>
      <ref id="ref20">
        <mixed-citation>
          [20]
          <string-name>
            <given-names>A.</given-names>
            <surname>Paszke</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Gross</surname>
          </string-name>
          ,
          <string-name>
            <given-names>F.</given-names>
            <surname>Massa</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Lerer</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Bradbury</surname>
          </string-name>
          , G. Chanan,
          <string-name>
            <given-names>T.</given-names>
            <surname>Killeen</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Z.</given-names>
            <surname>Lin</surname>
          </string-name>
          ,
          <string-name>
            <given-names>N.</given-names>
            <surname>Gimelshein</surname>
          </string-name>
          ,
          <string-name>
            <given-names>L.</given-names>
            <surname>Antiga</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Desmaison</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Köpf</surname>
          </string-name>
          ,
          <string-name>
            <given-names>E.</given-names>
            <surname>Yang</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Z.</given-names>
            <surname>DeVito</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Raison</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Tejani</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Chilamkurthy</surname>
          </string-name>
          ,
          <string-name>
            <given-names>B.</given-names>
            <surname>Steiner</surname>
          </string-name>
          ,
          <string-name>
            <given-names>L.</given-names>
            <surname>Fang</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Bai</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Chintala</surname>
          </string-name>
          ,
          <string-name>
            <surname>Pytorch:</surname>
          </string-name>
          <article-title>An imperative style, high-performance deep learning library</article-title>
          ,
          <year>2019</year>
          . URL: https://arxiv.org/abs/
          <year>1912</year>
          .01703. arXiv:
          <year>1912</year>
          .01703.
        </mixed-citation>
      </ref>
      <ref id="ref21">
        <mixed-citation>
          [21]
          <string-name>
            <given-names>K.</given-names>
            <surname>Cobbe</surname>
          </string-name>
          ,
          <string-name>
            <given-names>V.</given-names>
            <surname>Kosaraju</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Bavarian</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Chen</surname>
          </string-name>
          ,
          <string-name>
            <given-names>H.</given-names>
            <surname>Jun</surname>
          </string-name>
          ,
          <string-name>
            <given-names>L.</given-names>
            <surname>Kaiser</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Plappert</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Tworek</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Hilton</surname>
          </string-name>
          ,
          <string-name>
            <given-names>R.</given-names>
            <surname>Nakano</surname>
          </string-name>
          ,
          <string-name>
            <given-names>C.</given-names>
            <surname>Hesse</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Schulman</surname>
          </string-name>
          , Training verifiers to solve math word problems,
          <year>2021</year>
          . URL: https://arxiv.org/abs/2110.14168. arXiv:
          <volume>2110</volume>
          .
          <fpage>14168</fpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref22">
        <mixed-citation>
          [22]
          <string-name>
            <given-names>A. P.</given-names>
            <surname>Gema</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J. O. J.</given-names>
            <surname>Leang</surname>
          </string-name>
          ,
          <string-name>
            <given-names>G.</given-names>
            <surname>Hong</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Devoto</surname>
          </string-name>
          ,
          <string-name>
            <surname>A. C. M. Mancino</surname>
            ,
            <given-names>R.</given-names>
          </string-name>
          <string-name>
            <surname>Saxena</surname>
            ,
            <given-names>X.</given-names>
          </string-name>
          <string-name>
            <surname>He</surname>
            ,
            <given-names>Y.</given-names>
          </string-name>
          <string-name>
            <surname>Zhao</surname>
            ,
            <given-names>X.</given-names>
          </string-name>
          <string-name>
            <surname>Du</surname>
            ,
            <given-names>M. R. Ghasemi</given-names>
          </string-name>
          <string-name>
            <surname>Madani</surname>
            ,
            <given-names>C.</given-names>
          </string-name>
          <string-name>
            <surname>Barale</surname>
            ,
            <given-names>R.</given-names>
          </string-name>
          <string-name>
            <surname>McHardy</surname>
            ,
            <given-names>J.</given-names>
          </string-name>
          <string-name>
            <surname>Harris</surname>
            ,
            <given-names>J.</given-names>
          </string-name>
          <string-name>
            <surname>Kaddour</surname>
            ,
            <given-names>E. Van Krieken</given-names>
          </string-name>
          ,
          <string-name>
            <given-names>P.</given-names>
            <surname>Minervini</surname>
          </string-name>
          ,
          <article-title>Are we done with MMLU?</article-title>
          , in: L.
          <string-name>
            <surname>Chiruzzo</surname>
            ,
            <given-names>A.</given-names>
          </string-name>
          <string-name>
            <surname>Ritter</surname>
          </string-name>
          , L. Wang (Eds.),
          <source>Proceedings of the</source>
          <year>2025</year>
          <article-title>Conference of the Nations of the Americas Chapter of the Association for Computational Linguistics: Human Language Technologies (Volume 1: Long Papers), Association for Computational Linguistics</article-title>
          , Albuquerque, New Mexico,
          <year>2025</year>
          , pp.
          <fpage>5069</fpage>
          -
          <lpage>5096</lpage>
          . URL: https://aclanthology.org/
          <year>2025</year>
          .
          <article-title>naacl-long</article-title>
          .
          <volume>262</volume>
          /.
        </mixed-citation>
      </ref>
      <ref id="ref23">
        <mixed-citation>
          [23]
          <string-name>
            <given-names>A.</given-names>
            <surname>Gu</surname>
          </string-name>
          ,
          <string-name>
            <given-names>B.</given-names>
            <surname>Rozière</surname>
          </string-name>
          ,
          <string-name>
            <given-names>H.</given-names>
            <surname>Leather</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Solar-Lezama</surname>
          </string-name>
          ,
          <string-name>
            <given-names>G.</given-names>
            <surname>Synnaeve</surname>
          </string-name>
          ,
          <string-name>
            <surname>S. I. Wang</surname>
          </string-name>
          ,
          <article-title>Cruxeval: A benchmark for code reasoning, understanding</article-title>
          and execution,
          <year>2024</year>
          . URL: https://arxiv.org/abs/2401.03065. arXiv:
          <volume>2401</volume>
          .
          <fpage>03065</fpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref24">
        <mixed-citation>
          [24]
          <string-name>
            <given-names>B. Y.</given-names>
            <surname>Lin</surname>
          </string-name>
          ,
          <string-name>
            <given-names>R. L.</given-names>
            <surname>Bras</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Y.</given-names>
            <surname>Choi</surname>
          </string-name>
          ,
          <article-title>Zebralogic: Benchmarking the logical reasoning ability of language models</article-title>
          ,
          <year>2024</year>
          . URL: https://huggingface.co/spaces/allenai/ZebraLogic.
        </mixed-citation>
      </ref>
      <ref id="ref25">
        <mixed-citation>
          [25]
          <string-name>
            <given-names>B. Y.</given-names>
            <surname>Lin</surname>
          </string-name>
          ,
          <source>ZeroEval: A Unified Framework for Evaluating Language Models</source>
          ,
          <year>2024</year>
          . URL: https: //github.com/WildEval/ZeroEval.
        </mixed-citation>
      </ref>
      <ref id="ref26">
        <mixed-citation>
          [26]
          <string-name>
            <given-names>D.</given-names>
            <surname>Hendrycks</surname>
          </string-name>
          ,
          <string-name>
            <given-names>C.</given-names>
            <surname>Burns</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Basart</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Zou</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Mazeika</surname>
          </string-name>
          ,
          <string-name>
            <given-names>D.</given-names>
            <surname>Song</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Steinhardt</surname>
          </string-name>
          ,
          <article-title>Measuring massive multitask language understanding</article-title>
          ,
          <year>2021</year>
          . URL: https://arxiv.org/abs/
          <year>2009</year>
          .03300. arXiv:
          <year>2009</year>
          .03300.
        </mixed-citation>
      </ref>
    </ref-list>
  </back>
</article>