<!DOCTYPE article PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Archiving and Interchange DTD v1.0 20120330//EN" "JATS-archivearticle1.dtd">
<article xmlns:xlink="http://www.w3.org/1999/xlink">
  <front>
    <journal-meta />
    <article-meta>
      <title-group>
        <article-title>ReWiSe: Relation-Wise Self-consistency for LLM Probing</article-title>
      </title-group>
      <contrib-group>
        <contrib contrib-type="author">
          <string-name>Edouard Albert-Roulhac</string-name>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Amal Zouaq</string-name>
        </contrib>
        <contrib contrib-type="author">
          <string-name>LAMA-WeST Lab</string-name>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Polytechnique Montréal</string-name>
        </contrib>
      </contrib-group>
      <abstract>
        <p>Large Language Models (LLMs) learn facts and general knowledge from unstructured data. Evaluating the knowledge within an LLM and extracting it from its parametric memory are challenging tasks as models hallucinate and use a stochastic generation process. The LM-KBC competition challenges participants to propose an approach to construct a Knowledge Graph with an LLM and no external corpus. In this work, we propose a new approach called ReWiSe which uses chain-of-thought reasoning and relation-wise self-consistency. We create a synthetic chain-of-thought dataset with reasoning paths designed for the limited set of relations in the challenge. This synthetic dataset is then used as few-shot samples to make predictions. Chain-of-Thought reasoning provides gains for relations such as countryLandBordersCountry, where structured strategies (e.g., geographic enumeration) guide the model toward more complete answers. We finally propose to adapt self-consistency with a relation-wise approach that adapts to relation cardinality and schema. Our results show that relation-wise self-consistency leads to strong performance gains on the LM-KBC benchmark. Using 20 sampled generations, ReWiSe won the 2025 edition of LM-KBC with a Macro-F1 score of 44% (the baseline is 21%). The implementation is available at https://github.com/Lama-West/ReWiSe.</p>
      </abstract>
    </article-meta>
  </front>
  <body>
    <sec id="sec-1">
      <title>1. Introduction</title>
      <p>
        The LM-KBC challenge. The objective of the Language Models - Knowledge Base Construction
(LM-KBC) challenge [
        <xref ref-type="bibr" rid="ref6">6</xref>
        ] is to use a LLM to complete triples in a Wikidata-based KG. Wikidata is a free
online multilingual collaborative KG. It is general-purpose and contains over 115 billion triples [
        <xref ref-type="bibr" rid="ref18 ref7">7</xref>
        ].
Given a head entity and a relation (ℎ, ), the task is to use an LLM to output the list of all tail entities 
such that (ℎ, , ) is in the KG. Triples can have 0, 1 or many tail entities.
      </p>
      <p>In the 2025 edition, model fine-tuning and using external corpora with RAG are disallowed. The
Qwen3-8B model is imposed.</p>
      <p>Evaluation is done with Macro-F1 score. A 5% margin is allowed for numerical entities. String matching
is used for non-numerical entities.</p>
      <p>Relations. The dataset for the 2025 edition is a subset of Wikidata. It features 6 relations. Each one
only appears 10 to 100 times in both the training and the validation splits of the dataset. The relations
are as follows:
• countryLandBordersCountry (68 input pairs per split): This list can have 0, 1 or multiple
elements.
• personHasCityOfDeath (100 input pairs per split): Can be empty.
• hasCapacity (100 input pairs per split): Numeric (integer) answer.
• hasArea (100 input pairs per split): Numeric (real number) answer.
• awardWonBy (10 input pairs per split): The list can be long (up to 224).</p>
      <p>• companyTradesAtStockExchange (100 input pairs per split): Can be empty.</p>
      <p>This setting presents several challenges. First, LLM outputs are stochastic and sensitive to prompt
phrasing, making it hard to reliably extract structured knowledge from a single query. Second, prior
work shows that reasoning via intermediate steps can improve factual accuracy, but constructing
such reasoning paths at scale remains non-trivial. Finally, relations in a knowledge graph follow a
schema—ranging from single entities to long lists or numerical values—which complicates aggregation
across generations. In this work, we explore the following research questions:
1. Can we teach a systematic way to retrieve parametric knowledge through
Chain-ofThought?
We build HumanCoT which contains examples of systematic Chain-of-Thought reasoning for the
model to reproduce the same kind of reasoning during inference.
2. How can synthetic data generation be leveraged in data-scarce settings for LLM probing?
We generate synthetic reasoning paths that lead to the correct answer for every example of the
training set and use them in independent predictions with the self-consistency process.
3. How can we adapt self-consistency to relation schema in a Knowledge Graph, especially
when there is no assumption on the cardinality of the relation?
We aggregate the outputs of several model calls at the entity level. When there is no assumption
on the cardinality of the relation, we introduce a threshold which represents the confidence of
the model in predicting each entity. This enables to tune the aggregation strategy relation-wise.</p>
    </sec>
    <sec id="sec-2">
      <title>2. Related Work</title>
      <p>
        Several methods that tackle KG completion rely on accessing external knowledge at inference time [
        <xref ref-type="bibr" rid="ref8">8</xref>
        ]
or models trained on specific knowledge with the same entities and relations as those used at inference
time [
        <xref ref-type="bibr" rid="ref9">9</xref>
        ]. These methods and similar ones listed in [
        <xref ref-type="bibr" rid="ref10">10</xref>
        ] are not allowed in this edition of the challenge
which is about extracting the knowledge from the parametric memory of the LLM itself, not using the
LLM to process information retrieved in an existing database.
      </p>
      <p>
        Prompt-based methods attempt to use natural language to query the model, associating prompts to
relations following the approach of [
        <xref ref-type="bibr" rid="ref11">11</xref>
        ]. These prompts can be templates for masked language models
or questions for auto-regressive models. The search for an optimal prompt is a key challenge in this
context. AutoPrompt [
        <xref ref-type="bibr" rid="ref12">12</xref>
        ] optimizes text prompts for this purpose. The optimization of prompts can
also be done with soft tokens that are trained through gradient descent to retrieve information as in
Prefix Tuning [
        <xref ref-type="bibr" rid="ref13">13</xref>
        ]. This is not allowed in the challenge as it is a fine-tuning approach. Some work has
also been done to find optimal few-shot examples to provide to the models [
        <xref ref-type="bibr" rid="ref14">14</xref>
        ].
      </p>
      <p>
        A motivation for the search of a good prompt is the fact that similar prompts may not produce similar
answers [
        <xref ref-type="bibr" rid="ref1">1</xref>
        ]. This is still an open research problem.
      </p>
      <p>
        Chain-of-Thought reasoning enables the model to perform intermediate reasoning steps before
producing a final answer, efectively trading computational cost for improved answer quality [
        <xref ref-type="bibr" rid="ref15">15</xref>
        ].
The intermediate context generated by the model functions as a form of working memory. Recent
models—particularly Qwen3 [
        <xref ref-type="bibr" rid="ref16">16</xref>
        ]—are explicitly trained to respond using Chain-of-Thought reasoning.
This approach is valuable as it allows the model to break down complex problems into steps and to
externalize contextual knowledge stored in its parametric memory via this working memory [
        <xref ref-type="bibr" rid="ref17">17</xref>
        ], which
can aid in arriving at a more accurate final answer. However, generating reasoning paths at scale is an
open problem. We address this by generating a synthetic dataset tailored to each relation.
Ensembling and consistency methods leverage multiple prompts and generation attempts from
large language models (LLMs) to improve robustness to prompt variability. For example, [18] generate
paraphrased versions of an initial prompt to diversify inputs. Self-evaluation techniques allow the
model to assess and either confirm or refute its previous outputs [19].
      </p>
      <p>Similarly, Self-Consistency [20] builds on the inherent stochasticity of text generation. It produces
multiple reasoning paths and answers to a given question, then selects the most consistent answer
using majority voting. This approach assumes a single correct answer—typically a discrete choice in a
multiple-choice question or a numerical value. Universal Self-Consistency [21] extends this idea by
proposing a more general framework capable of handling a broader range of answer types. Instead of
relying on majority voting, it prompts an LLM to select the most consistent answer from among the
generated candidates, allowing plain-text answers to be considered valid.</p>
      <p>However, KG completion introduces a more structured setting than general QA, which must be taken
into account. Each relation in a KG follows a specific schema. Some relations expect numerical outputs,
while others require lists of object entities—sometimes with cardinality constraints. Exploiting this
underlying structure is essential for efective answer aggregation in this context, which is what we
propose in this work. To the best of our knowledge, self-consistency has not previously been applied to
structured prediction tasks like KG completion.</p>
      <p>Contributions. In order to leverage the full training set and the reasoning potential of the LLM, we
propose to use a synthetic data generation process, in which we automatically generate reasoning paths
leading to a correct answer. This synthetic dataset is then used to teach the model how to complete
triples based on the relation involved. In addition, we propose a relation-wise version of self-consistency.
It exploits the stochasticity of generative models and adapts to relations’ schema.</p>
    </sec>
    <sec id="sec-3">
      <title>3. Methodology</title>
      <sec id="sec-3-1">
        <title>3.1. Baseline</title>
        <p>The baseline presented by the competition uses few-shot prompting. Each relation is associated to a
question and 5 examples of questions and answers from the training set are provided before asking a
question from the test set. The model generates a list of answers directly. The baseline’s performance is
shown in Table 1. It reaches a Macro-F1 score of 20.0%.</p>
        <p>Question Prompting, using a question which expects the tail entities as answer.</p>
      </sec>
      <sec id="sec-3-2">
        <title>3.2. Relation-Wise Self-consistency</title>
        <p>We propose a new approach called Relation-Wise Self-consistency (ReWiSe) that combines synthetic
reasoning paths with entity-level aggregation of predictions. ReWiSe proceeds through four main steps
summarized below.</p>
        <p>HumanCoT. We manually craft reasoning paths, called HumanCoT, for a few triples per relation.
Each reasoning path is free-form text which guides the model toward producing the correct answer.
The goal is that the model copies the reasoning strategies from HumanCoT in the inference process.
SyntheticCoT. Using HumanCoT as few-shot examples, we generate synthetic reasoning paths and
answers for other training triples. Only outputs leading to correct answers are kept.
Inference Generation. For each test triple, we generate multiple reasoning paths and corresponding
answers. This ensembling, known as Self-Consistency, addresses stochasticity in both model generation
and few-shot example selection.</p>
        <p>Relation-Wise Aggregation. Finally, we discard the reasoning paths and aggregate answers at the
entity level. Depending on the schema of the relation, diferent strategies are tested on the validation
set and the best one is kept for each relation.</p>
        <p>These steps are detailed in the following sections.</p>
      </sec>
      <sec id="sec-3-3">
        <title>3.3. HumanCoT dataset</title>
        <p>In order to impose a systematic way to reason about each relation, we need a dataset with correct
reasoning paths associated to each triple. Instead of writing them for every triple of the training set
- which is not scalable - we write only a few examples per relation in a HumanCoT dataset. In total,
the HumanCoT dataset contains 20 reasoning examples across the six relations. These examples are
used in few-shot prompting settings to create a SyntheticCoT dataset. The dataset is available in our
repository1.</p>
        <p>The reasoning path starts with a &lt;think&gt; tag and ends with a &lt;/think&gt; tag allowing for easy parsing of
the generated text. We will denote  the reasoning path and  the answer.</p>
        <p>Here is, for each relation, the systematic reasoning that was used to build HumanCoT dataset. By using
examples from HumanCoT to build SyntheticCoT, and then using SyntheticCoT examples for inference,
we aim to propagate the way of reasoning about relations consistently throughout the pipeline.
We rely on heuristics and empirical tests to build the reasoning paths. It should start with a short
definition of the subject entity and general knowledge about it. The model should identify critical
knowledge to answer the question [22]. Triples included in HumanCoT are chosen to represent the
schema of each relation. For numeric relations, HumanCoT triples represent the range of possible
outputs. If a relation cannot have a correct answer, a HumanCoT triple will have no correct answer.
Overall, two groups of relations emerge: those that could be addressed through reasoning steps to guide
the model, and those that relied on factual knowledge and could not necessarily be resolved through
reasoning alone.</p>
        <p>The first group includes countryLandBordersCountry, personHasCityOfDeath and
awardWonBy.
countryLandBordersCountry. The strategy adopted here is to reason geographically by checking
land neighbors around the clock. For this relation, we provide 4 examples of human reasoning paths.
Example: Which countries share a land border with Kenya?
Chain-of-Thought &lt;think&gt; Kenya is a country located in East Africa, and its eastern side is bordered
by the Indian Ocean. Therefore, Kenya has no land borders on its eastern side. To determine its land
borders, we need to consider neighboring countries in the region. First, looking to the north, Ethiopia
is directly adjacent to Kenya, making it a clear land neighbor. Moving northeast, Somalia lies next to
Kenya and extends along the eastern Horn of Africa. To the west, Kenya shares a border with Uganda.
While the Democratic Republic of the Congo is nearby, it does not border Kenya directly because
Uganda lies between them. To the northwest, Kenya shares a small border with South Sudan but it does
not touch Sudan, near the Ilemi Triangle region. Finally, to the south, Kenya is bordered by Tanzania,
with the border extending along the Serengeti-Maasai Mara area. &lt;/think&gt; Ethiopia, Somalia, South
Sudan, Tanzania, Uganda.
personHasCityOfDeath. For this relation, the strategy is first to determine whether the person has
died, then to retrieve the city of death. A frequent LLM error is to output a city where the person lived
rather than died. Importantly, death is definitive: if the model has seen evidence of a person’s death, it
can remember it, whereas mentions of someone being alive don’t guarantee they’re still living.
For this relation, we provide 4 examples of human reasoning paths.</p>
        <p>Example: In which city did Ricky Nelson die?
Chain-of-Thought &lt;think&gt; Ricky Nelson was an American musician and actor. He is known for his
country rock style. He lived in Los Angeles but died in an aircraft crash near Dallas, Texas. &lt;/think&gt;
Dallas
1https://github.com/Lama-West/ReWiSe/blob/main/data/dataset2025/cot/human_cot_single.csv
awardWonBy. This relation is challenging as the number of ground truth outputs can be large.
Empirically, we found that generating the whole list of outputs is too hard. Instead, our strategy is
to generate coherent subgroups of award winners, for example by grouping them by year of award,
gender, or achievement. The model has to identify a subgroup and generate recipients’ names. When
the model generates multiple subgroups, they can later be aggregated to reconstruct the full set of
outputs as detailed in the inference section.</p>
        <p>For this relation, we provide 2 reasoning paths for each of 2 subject entities.</p>
        <p>Example: Who won the Fields Medal? Subgroups: 1990 recipients and female recipients.
Chain-of-Thought for 1990 recipients: &lt;think&gt; The Fields Medal is one of the most prestigious
awards in mathematics, given every four years to individuals under the age of 40 who have made
outstanding contributions to the field. These are Fields Medal recipients from 1990: Vladimir Drinfeld,
Vaughan Jones, Shigefumi Mori, and Edward Witten. &lt;/think&gt; Vladimir Drinfeld, Vaughan Jones,
Shigefumi Mori, Edward Witten.</p>
        <p>For the last three relations - hasCapacity, hasArea and companyTradesAtStockExchange
we were unable to devise a strategy, as the triple completion task does not lend itself to reasoning-based
methods. Instead, the goal is for the model to draw upon its stored knowledge about the subject entity
with the hope that this will guide it towards the right answer. We show examples of CoTs used for these
three relations below:
hasCapacity. For this relation, we provide 3 examples of human reasoning paths.
Example: What is the total capacity of Gaddafi Stadium?
Chain-of-Thought &lt;think&gt; Gaddafi Stadium is a major cricket stadium located in Lahore, Pakistan.
Named after former Libyan leader Muammar Gaddafi, it is one of the largest cricket stadiums in Pakistan
and serves as the home ground for the Pakistan national cricket team. The stadium has undergone
several renovations and expansions since its construction, with its current capacity being designed
to accommodate 60,000 spectators, making it one of the largest cricket venues in the world. &lt;/think&gt;
60000
hasArea. For this relation, we provide 3 examples of human reasoning paths.</p>
        <p>Example: What is the area of Heimaey in square kilometers?
Chain-of-Thought &lt;think&gt; Heimaey is the largest and only populated island in the Westman Islands
(Vestmannaeyjar) archipelago, located of the southern coast of Iceland. Despite its small size, Heimaey
is famous for being geologically active. The island has a total area of 13.4 square kilometers. &lt;/think&gt;
13.4
companyTradesAtStockExchange. For this relation, we provide 3 examples of human reasoning
paths.</p>
        <p>Example: Where do shares of Novadios trade?
Chain-of-Thought &lt;think&gt; Novadios provides legal process outsourcing services. It is based in Los
Angeles, California and operates in Argentina but it is privately held. &lt;/think&gt; None</p>
      </sec>
      <sec id="sec-3-4">
        <title>3.4. SyntheticCoT Dataset</title>
        <p>3.4.1. Synthetic generation
The SyntheticCoT dataset is generated for triples from the training set using few-shot prompting with
examples from HumanCoT. We use 2-shot prompting and Llama 3.3 70B Instruct as a model to distill
reasoning ability from a larger model into synthetic examples. When attempting this with smaller
models, we encountered frequent formatting errors, leading to an insuficiently sized SyntheticCoT
dataset.</p>
        <p>Given a triple from the training set and which does not appear in the HumanCoT dataset, this process
can be repeated multiple times. Because of the stochasticity of both the generation process and the
choice of few-shot examples, the reasoning paths and answers might difer. Repeating the process more
times will raise the number of correct answers and increase the diversity of reasoning paths. We repeat
it 10 times in our experiments. Raising this hyperparameter will raise the size of SyntheticCoT.
Example from the SyntheticCoT Dataset
1 System (instructions)
2 Answer the question after a brief chain of reasoning. First,
3 write your reasoning between &lt;think&gt; and &lt;/think&gt; tags. Then,
4 directly list the answers, separated by commas. If there’s no
5 answer, type "None". Be concise.
6 User (few-shot example from HumanCoT)
7 In which city did Jan Turski die?
8 Assistant (few-shot example from HumanCoT)
9 &lt;think&gt; Jan Turski was a Polish politician and diplomat. He was
10 born in 1940 and died in Warsaw in 2016. &lt;/think&gt; Warsaw
11 User (few-shot example from HumanCoT)
12 In which city did Konrad Rufus Müller die?
13 Assistant (few-shot example from HumanCoT)
14 &lt;think&gt; Konrad Rufus Müller is a German photographer renowned
15 for his black-and-white portraits of German politicians until
16 Angela Merkel. I don’t have information suggesting that he has
17 passed away.
18 &lt;/think&gt; None
19 User (new question)
20 In which city did Jeroen Brouwers die?
21 Assistant (generated, will be added to SyntheticCoT)
22 &lt;think&gt; Jeroen Brouwers was a Dutch journalist, writer and
23 critic. He passed away in 2022 in Maastricht or possibly in a
24 different city in the region, but Maastricht is a known
25 location associated with him. &lt;/think&gt; Maastricht</p>
        <sec id="sec-3-4-1">
          <title>3.4.2. Filtering</title>
          <p>Correctness of answers. After generating reasoning paths and answers, we discard the reasoning
paths and keep only the answers. We then classify these as correct, incorrect, or incomplete. An
answer is correct if it exactly matches the ground truth list. It is incomplete if it is non-empty and
strictly contained within the ground truth, which is especially relevant for relations like awardWonBy
where the true list is often long and hard to fully reproduce. Incomplete answers still carry valuable
information and are retained. Numerical answers are accepted if they fall within a 5% margin of the
ground truth, in line with the competition’s evaluation protocol. All other outputs are filtered out.
This filtering follows the heuristic that the reasoning is most likely correct if it leads to the correct
answer.</p>
          <p>The generation process yields 4,590 reasoning paths in total, of which 2878 are kept, 2794 are correct
and 84 are incomplete. See Annex, Table 4 for per-relation details. The dataset is available in our
repository2.
2https://github.com/Lama-West/ReWiSe/blob/main/data/dataset2025/cot/nohelp_synthetic_cot/clean/new_synthetic_
llama_correct_incomplete.csv</p>
        </sec>
      </sec>
      <sec id="sec-3-5">
        <title>3.5. Inference on the test set</title>
        <p>Experiments with zero-shot prompting revealed that the model often failed to follow the instructions
correctly, highlighting the need for examples. In addition, when using few-shot prompting, the model
should copy the reasoning strategy introduced in HumanCoT. This approach also helps establish the
expected format, which is essential for processing the generated answers.</p>
        <p>For the generation step, we use a generative large language model LLM and few-shot prompting on
SyntheticCoT.  is our few-shot parameter. Let’s note  the number of sampled reasoning paths. 
can be viewed as an ensembling method parameter. Raising it diminishes the risk associated to relying
on a single sample. The aggregated prediction is based on a larger set of independent answers.</p>
        <p>Given a head entity ℎ from the test set and the associated relation , we do these actions  times:
1. Select randomly  elements from SyntheticCoT with relation 
2. Build a few-shot prompt with these elements and (ℎ, )
3. Sample from LLM to get (, )
This process yields  outputs. Let’s note them ( , )1≤ ≤  .</p>
        <p>Inference uses  = 5 random few-shot examples from SyntheticCoT that share the same
relation
1 System (instructions)
2 Answer the question after a brief chain of reasoning. First,
3 write your reasoning between &lt;think&gt; and &lt;/think&gt; tags. Then,
4 directly list the answers, separated by commas. If there’s no
5 answer, type "None". Be concise.
6 User (example 1, from SyntheticCoT)
7 In which city did Milan Lasica die?
8 Assistant (example 1, from SyntheticCoT)
9 &lt;think&gt; Milan Lasica was a Slovak actor, writer, and dramatist.
10 He died in 2021 in Bratislava. &lt;/think&gt; Bratislava
11 ...
12 User (question from test set)
13 In which city did Jeroen Brouwers die?
14 Assistant</p>
      </sec>
      <sec id="sec-3-6">
        <title>3.6. Relation-wise self-consistency</title>
        <p>Once the answers have been generated, we can discard the reasoning paths and move on to
aggregation—this constitutes the second step of self-consistency. Standard self-consistency aggregates
answers () using majority voting. In our setup, however, each answer is a list of object entities
 = [,1, . . . , , ], allowing us to perform aggregation at the entity level rather than at the answer
level.</p>
        <p>For one-to-many and numerical relations, we define specific aggregation methods in the next
sections. For other relations expecting a single entity, such as personHasCityOfDeath, we simply apply
majority voting over the predicted entities, including "None" as a possible outcome.</p>
        <sec id="sec-3-6-1">
          <title>3.6.1. Self-consistency for one-to-many relations</title>
          <p>For relations that have no assumption on the length of the prediction, we define the consistency
threshold  ∈ (0, 1). The final prediction is the concatenation of entities which appear in at least  × 
answers. Raising  means that only tail entities consistently predicted by the model are part of the final
prediction.</p>
          <p>Instead of using an arbitrary value, we use a threshold   for each relation  on the validation set. This
process allows for an automatic specialization of the aggregation process on relations.</p>
          <p>For example, let’s consider the triple ( USA, countryLandBordersCountry, [Canada, Mexico] ) with
 = 4. This means that we use the inference model to generate four independent reasoning paths and
answers. The answers are [Canada, Mexico], [Mexico], [Canada], [Canada, Panama]. Majority voting
at the answer level gives a tie as no list appears twice. With a threshold  = 0.5 we keep all entities
which appear in at least 2 independent answers. The aggregated answer is [Canada, Mexico].</p>
        </sec>
        <sec id="sec-3-6-2">
          <title>3.6.2. Self-consistency for numerical relations</title>
          <p>In the case of numerical relations, we can assume that every answer () will be a numerical singleton
[]. Aggregation of all answers can be made with majority voting or with numerical aggregation
methods such as using the median or the average.</p>
          <p>The choice between these three aggregation methods is made relation-wise on the validation set.</p>
        </sec>
      </sec>
    </sec>
    <sec id="sec-4">
      <title>4. Experiments</title>
      <p>All experiments were conducted using Qwen3-8B. For few-shot prompting, we used 5 diferent examples
per relation, sampled randomly from either the training set (without CoT) or from SyntheticCoT (with
CoT). When Chain-of-Thought (CoT) is enabled, the model is explicitly instructed to produce reasoning
enclosed between think tags before generating the final answer. We use self-consistency parameter
values of 1 (no self-consistency) and 20 (with self-consistency). We first find the optimal aggregation
strategy on the validation set and then apply it to the test set and report the results per relation on the
test and validation sets respectively in Tables 2 and 3.</p>
      <sec id="sec-4-1">
        <title>4.1. Relation-wise optimal aggregation strategy</title>
        <p>This experiment is designed to find the optimal strategy for each relation. It uses the experiments with
Chain-of-Thought and a number of sampled reasoning paths  = 20. Figure 1 shows the Macro-F1
score on the validation set for every strategy that is used per relation. For one-to-many relations, the
strategies used correspond to the threshold  varying between 0 and 1. For numerical strategies, we
compare majority voting, and mean and median voting.</p>
        <p>According to Figure 1, the best strategy is to use thresholds 0.05 for awardWonBy,
0.3 for companyTradesAtStockExchange and 0.5 for countryLandBordersCountry. For
personHasCityOfDeath, there cannot be multiple answers, so answers are aggregated using majority
voting and considering None as a valid proposition. According to Figure 1, the best strategy is to use
the median for hasArea, and majority voting for hasCapacity. In our submission however, we fixed
hasCapacity to median based on earlier experiments. Median also appears stronger when looking
across consistency levels, not only at level 20. Since both strategies are close in performance, this
diference has little impact on the overall results.</p>
        <p>The strategy analysis without CoT can be found in the Annex (Figure 2).</p>
      </sec>
      <sec id="sec-4-2">
        <title>4.2. Results split per relation</title>
        <p>( = 1). The experiments are compared with macro precision, recall and F1 score. The highest score
per relation is in bold. When self-consistency is used, the optimal strategy is used.</p>
        <p>An analysis of Tables 2 and 3 highlights several notable points:
• Self-consistency ( = 20) convincingly improves performance over predictions with a single
generation ( = 1). This is consistently the case when using Chain-of-Thought reasoning but
also when not using it.
• Chain-of-Thought reasoning yields mixed efects. For some relations like
countryLandBordersCountry and companyTradesAtStockExchange, CoT combined
with self-consistency provides the highest Macro-F1. However, for relations like awardWonBy
and personHasCityOfDeath, CoT does not always improve—and sometimes slightly
Relation
awardWonBy
awardWonBy
awardWonBy
awardWonBy
companyTradesAtStockExchange
companyTradesAtStockExchange
companyTradesAtStockExchange
companyTradesAtStockExchange
countryLandBordersCountry
countryLandBordersCountry
countryLandBordersCountry
countryLandBordersCountry
hasArea
hasArea
hasArea
hasArea
hasCapacity
hasCapacity
hasCapacity
hasCapacity
personHasCityOfDeath
personHasCityOfDeath
personHasCityOfDeath
personHasCityOfDeath
All Relations
All Relations
All Relations
All Relations</p>
        <p>No
No
Yes
Yes
No
No
Yes
Yes
No
No
Yes
Yes
No
No
Yes
Yes
No
No
Yes
Yes
No
No
Yes
Yes
No
No
Yes
Yes
reduces—performance compared to direct generation.
• Numerical relations (hasArea, hasCapacity) achieve modest gains with CoT and
selfconsistency. Their absolute Macro-F1 remains low, highlighting the dificulty of these relations.
• Overall performance is highest when combining CoT with self-consistency ( = 20), reaching a</p>
        <p>Macro-F1 score of 44.4%, beating the baseline (21.2%) by 23.2 points on the test set.</p>
      </sec>
    </sec>
    <sec id="sec-5">
      <title>5. Discussion and Limitations</title>
      <sec id="sec-5-1">
        <title>5.1. What if the model doesn’t know the answer?</title>
        <p>For some training triples, the model consistently fails to produce a correct answer during the construction
of SyntheticCoT. This raises the issue of what the model actually knows, and whether attempting to
retrieve certain knowledge from its parametric memory is meaningful. Since research on interpreting
and extracting knowledge from LLM weights is ongoing, filtering out information the model clearly
doesn’t know could help reduce reliance on guesses. In our approach, we address this by removing
incorrect answers from SyntheticCoT: any triple the model never answers correctly during SyntheticCoT
generation is excluded from the prompts used at inference time. As SyntheticCoT is generated with a
larger model, this strategy assumes that the knowledge of the smaller Qwen3-8B model is contained
within that of the larger Llama3.3 70B model.</p>
      </sec>
      <sec id="sec-5-2">
        <title>5.2.  as a way to control the precision / recall tradeof</title>
        <p>The consistency threshold  is a hyperparameter controlling the precision / recall tradeof. As an
example, triples involving the relation awardWonBy expect many object entities. The challenge is thus
to make many propositions, keeping entities that are rarely predicted, which corresponds to setting a
low threshold. The recall is indeed lower than the precision in our experiments on the validation set
(see table 3). In this case, the optimal threshold was 0.05 (see Figure 1) which means that all entities
that appear in at least 5% of independent answers are kept in the prediction.</p>
        <p>The strategy in HumanCoT for this relation was to predict only a subgroup of tail entities and rely on
self-consistency to aggregate diferent subgroups. This is consistent with fixing a low threshold.
We further analyzed the efect of increasing the number of self-consistency samples (see Annex, Fig. 3)
and observed that gains saturate quickly, typically before  = 20, with the ceiling varying by relation.</p>
      </sec>
      <sec id="sec-5-3">
        <title>5.3. Sampling few-shot examples from SyntheticCoT</title>
        <p>We also tested whether self-consistency remains efective when the few-shot examples are fixed across
generations. In this set-up, as we got rid of the stochasticity in the choice of few-shot examples, the
stochasticity comes exclusively from the generation process. Using a reduced SyntheticCoT of five
random examples per relation, performance averaged 0.398 Macro-F1 on the validation set (across
seven runs), compared to 0.415 when sampling from the full SyntheticCoT dataset. This shows that
self-consistency benefits from diversity in few-shot examples. The generation process still provides
enough stochasticity to induce a large improvement over single-sample inference which gets 19.7.</p>
      </sec>
      <sec id="sec-5-4">
        <title>5.4. Limitations</title>
        <p>There are a few limitations to our work.</p>
        <p>• First, the method still depends on a human annotator to write HumanCoT examples, and the
quality of this dataset is crucial to the overall performance. In our case, it was constructed based
on heuristics. Exploring ways to optimize the construction of HumanCoT and assessing the
scalability of the method—particularly whether HumanCoT can be efectively applied to unseen
relations—would be valuable directions for future work.
• Second, the way we build SyntheticCoT assumes that a CoT is correct if the answer is correct.</p>
        <p>This is not always the case, especially when there is no correct answer. A better filtering would
be beneficial to have a cleaner SyntheticCoT.
• Finally, we still have not found the ideal solution for relations such as awardWonBy and
hasCapacity. Whether the model weights actually contain the required answers is an open
question.</p>
      </sec>
    </sec>
    <sec id="sec-6">
      <title>6. Conclusion</title>
      <p>In this paper, we propose a Relation-Wise Self-consistency, a method that uses a very limited amount of
human-written Chains-of-Thought to build SyntheticCoT, a larger and synthetic set of reasoning paths
that lead to a correct answer. This dataset is used in a few-shot prompting fashion to give a systematic
way to reason about given relations in order to complete triples. We then adapt self-consistency to
a general setting without making assumptions about the cardinality of the relations, instead tuning
the prediction aggregation process on a per-relation basis. Our results show that our method ReWiSe
improves performance on the LM-KBC 2025 challenge, achieving a Macro-F1 score of 44.4% on the test
set, a gain of 23.2 points over the baseline.</p>
      <p>Overall, this year’s edition of the LM-KBC challenge did not allow fine-tuning. However, training the
LLM to learn relations instead of few-shot prompting could be an interesting future work. It could take
advantage of our SyntheticCoT dataset. The intuition for self-consistency is that multiple reasoning
paths can lead to the correct answer. Training on SyntheticCoT would take advantage of the multiplicity
of reasoning paths that lead to the correct answer for a given relation.</p>
    </sec>
    <sec id="sec-7">
      <title>Declaration on Generative AI</title>
      <p>During the preparation of this work, the author(s) used ChatGPT in order to: Grammar and spelling
check, Paraphrase and reword. After using these tool(s)/service(s), the author(s) reviewed and edited
the content as needed and take(s) full responsibility for the publication’s content.</p>
    </sec>
    <sec id="sec-8">
      <title>Acknowledgments</title>
      <p>We thank NSERC for supporting this work through a Discovery Grant and Compute Canada and Calcul
Québec for the computational resources provided.
[18] Z. Jiang, F. F. Xu, J. Araki, G. Neubig, How Can We Know What Language Models Know?, 2020.</p>
      <p>URL: http://arxiv.org/abs/1911.12543. doi:10.48550/arXiv.1911.12543, arXiv:1911.12543 [cs].
[19] S. Kadavath, T. Conerly, A. Askell, T. Henighan, D. Drain, E. Perez, N. Schiefer, Z.
HatfieldDodds, N. DasSarma, E. Tran-Johnson, S. Johnston, S. El-Showk, A. Jones, N. Elhage, T. Hume,
A. Chen, Y. Bai, S. Bowman, S. Fort, D. Ganguli, D. Hernandez, J. Jacobson, J. Kernion, S. Kravec,
L. Lovitt, K. Ndousse, C. Olsson, S. Ringer, D. Amodei, T. Brown, J. Clark, N. Joseph, B. Mann,
S. McCandlish, C. Olah, J. Kaplan, Language Models (Mostly) Know What They Know, 2022. URL:
http://arxiv.org/abs/2207.05221. doi:10.48550/arXiv.2207.05221, arXiv:2207.05221 [cs].
[20] X. Wang, J. Wei, D. Schuurmans, Q. Le, E. Chi, S. Narang, A. Chowdhery, D. Zhou, Self-Consistency
Improves Chain of Thought Reasoning in Language Models, 2023. URL: http://arxiv.org/abs/2203.
11171. doi:10.48550/arXiv.2203.11171, arXiv:2203.11171 [cs].
[21] X. Chen, R. Aksitov, U. Alon, J. Ren, K. Xiao, P. Yin, S. Prakash, C. Sutton, X. Wang, D. Zhou,
Universal Self-Consistency for Large Language Model Generation, 2023. URL: http://arxiv.org/abs/
2311.17311. doi:10.48550/arXiv.2311.17311, arXiv:2311.17311 [cs].
[22] Y. Wang, S. Zhao, Z. Wang, H. Huang, M. Fan, Y. Zhang, Z. Wang, H. Wang, T. Liu, Strategic
Chain-of-Thought: Guiding Accurate Reasoning in LLMs through Strategy Elicitation, 2024. URL:
http://arxiv.org/abs/2409.03271. doi:10.48550/arXiv.2409.03271, arXiv:2409.03271 [cs].</p>
    </sec>
    <sec id="sec-9">
      <title>7. Annex</title>
      <sec id="sec-9-1">
        <title>7.1. Relation-wise optimal aggregation strategy</title>
        <p>Raising  is an ensembling method and it diminishes the risk as the prediction uses a larger set
of independent answers. The prediction is relying less on a single answer as  grows. For both
experiments with and without CoT, and using  = 20, the scores of the diferent aggregation strategies
for each relation on the validation set are shown in Figures 1 and 2, respectively.</p>
        <p>Figure 3 illustrates the impact of raising the number of self-consistency samples  from 1 up to 100
for each relation on the validation set Macro-F1 score. The curves show that performance generally
increases quickly at small  (up to around 20), after which it plateaus. Most of the benefit of
relationwise self-consistency can thus be achieved with relatively few samples, although the exact point of
saturation depends on the relation. For example, personHasCityOfDeath continues to improve
until around  = 20, while companyTradesAtStockExchange reaches its plateau earlier (around
 = 10). In contrast, hasArea remains low overall, indicating limited benefit from self-consistency
for this relation.</p>
      </sec>
    </sec>
  </body>
  <back>
    <ref-list>
      <ref id="ref1">
        <mixed-citation>
          [1]
          <string-name>
            <given-names>X.</given-names>
            <surname>Liu</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Y.</given-names>
            <surname>Zheng</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Z.</given-names>
            <surname>Du</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Ding</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Y.</given-names>
            <surname>Qian</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Z.</given-names>
            <surname>Yang</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Tang</surname>
          </string-name>
          , GPT Understands, Too,
          <year>2023</year>
          . URL: http://arxiv.org/abs/2103.10385. doi:
          <volume>10</volume>
          .48550/arXiv.2103.10385, arXiv:
          <fpage>2103</fpage>
          .10385 [cs].
        </mixed-citation>
      </ref>
      <ref id="ref2">
        <mixed-citation>
          [2]
          <string-name>
            <given-names>M.</given-names>
            <surname>Zhang</surname>
          </string-name>
          , O. Press,
          <string-name>
            <given-names>W.</given-names>
            <surname>Merrill</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Liu</surname>
          </string-name>
          ,
          <string-name>
            <given-names>N. A.</given-names>
            <surname>Smith</surname>
          </string-name>
          ,
          <string-name>
            <surname>How Language Model Hallucinations Can Snowball</surname>
          </string-name>
          ,
          <year>2023</year>
          . URL: http://arxiv.org/abs/2305.13534. doi:
          <volume>10</volume>
          .48550/arXiv.2305.13534, arXiv:
          <fpage>2305</fpage>
          .13534 [cs].
        </mixed-citation>
      </ref>
      <ref id="ref3">
        <mixed-citation>
          [3]
          <string-name>
            <given-names>R.</given-names>
            <surname>Zhao</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Köksal</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Modarressi</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M. A.</given-names>
            <surname>Hedderich</surname>
          </string-name>
          ,
          <string-name>
            <given-names>H.</given-names>
            <surname>Schütze</surname>
          </string-name>
          ,
          <string-name>
            <surname>Do We Know What LLMs Don't Know</surname>
          </string-name>
          ?
          <article-title>A Study of Consistency in Knowledge Probing</article-title>
          ,
          <year>2025</year>
          . URL: http://arxiv.org/abs/2505.21701. doi:
          <volume>10</volume>
          .48550/arXiv.2505.21701, arXiv:
          <fpage>2505</fpage>
          .21701 [cs].
        </mixed-citation>
      </ref>
      <ref id="ref4">
        <mixed-citation>
          [4]
          <string-name>
            <given-names>C.</given-names>
            <surname>Jiang</surname>
          </string-name>
          ,
          <string-name>
            <given-names>B.</given-names>
            <surname>Qi</surname>
          </string-name>
          ,
          <string-name>
            <given-names>X.</given-names>
            <surname>Hong</surname>
          </string-name>
          ,
          <string-name>
            <given-names>D.</given-names>
            <surname>Fu</surname>
          </string-name>
          , Y. Cheng,
          <string-name>
            <given-names>F.</given-names>
            <surname>Meng</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Yu</surname>
          </string-name>
          ,
          <string-name>
            <given-names>B.</given-names>
            <surname>Zhou</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Zhou</surname>
          </string-name>
          , On Large Language Models' Hallucination with Regard to Known Facts,
          <year>2024</year>
          . URL: http://arxiv.org/abs/2403.20009. doi:
          <volume>10</volume>
          .48550/arXiv.2403.20009, arXiv:
          <fpage>2403</fpage>
          .20009 [cs].
        </mixed-citation>
      </ref>
      <ref id="ref5">
        <mixed-citation>
          [5]
          <string-name>
            <given-names>A.</given-names>
            <surname>Mallen</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Asai</surname>
          </string-name>
          ,
          <string-name>
            <given-names>V.</given-names>
            <surname>Zhong</surname>
          </string-name>
          ,
          <string-name>
            <surname>R. Das</surname>
            ,
            <given-names>D.</given-names>
          </string-name>
          <string-name>
            <surname>Khashabi</surname>
            ,
            <given-names>H.</given-names>
          </string-name>
          <string-name>
            <surname>Hajishirzi</surname>
          </string-name>
          , When Not to Trust Language Models: Investigating Efectiveness of Parametric and
          <string-name>
            <surname>Non-Parametric Memories</surname>
          </string-name>
          ,
          <year>2023</year>
          . URL: http://arxiv.org/abs/2212.10511. doi:
          <volume>10</volume>
          .48550/arXiv.2212.10511, arXiv:
          <fpage>2212</fpage>
          .10511 [cs].
        </mixed-citation>
      </ref>
      <ref id="ref6">
        <mixed-citation>
          [6]
          <string-name>
            <given-names>J.-C.</given-names>
            <surname>Kalo</surname>
          </string-name>
          ,
          <string-name>
            <given-names>T.-P.</given-names>
            <surname>Nguyen</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Razniewski</surname>
          </string-name>
          ,
          <string-name>
            <surname>B. Zhang,</surname>
          </string-name>
          <article-title>Lm-kbc challenge @ iswc 2025, in: 4th Semantic Web Challenge on Language Models for Knowledge Base Construction Challenge</article-title>
          ,
          <year>2025</year>
          . URL: https://lm-kbc.github.io/challenge2025/.
        </mixed-citation>
      </ref>
      <ref id="ref7">
        <mixed-citation>
          [7]
          <string-name>
            <given-names>D.</given-names>
            <surname>Vrandečić</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Krötzsch</surname>
          </string-name>
          ,
          <article-title>Wikidata: a free collaborative knowledgebase</article-title>
          ,
          <source>Communications of the ACM</source>
          <volume>57</volume>
          (
          <year>2014</year>
          )
          <fpage>78</fpage>
          -
          <lpage>85</lpage>
          . URL: https://dl.acm.org/doi/10.1145/2629489. doi:
          <volume>10</volume>
          .1145/2629489.
        </mixed-citation>
      </ref>
      <ref id="ref8">
        <mixed-citation>
          [8]
          <string-name>
            <given-names>B.</given-names>
            <surname>Zhang</surname>
          </string-name>
          , I. Reklos,
          <string-name>
            <given-names>N.</given-names>
            <surname>Jain</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A. M.</given-names>
            <surname>Peñuela</surname>
          </string-name>
          , E. Simperl,
          <article-title>Using Large Language Models for Knowledge Engineering (LLMKE):</article-title>
          <source>A Case Study on Wikidata</source>
          ,
          <year>2023</year>
          . URL: https://arxiv.org/abs/2309.08491. doi:
          <volume>10</volume>
          .48550/ARXIV.2309.08491, version Number:
          <volume>1</volume>
          .
        </mixed-citation>
      </ref>
      <ref id="ref9">
        <mixed-citation>
          [9]
          <string-name>
            <given-names>L.</given-names>
            <surname>Wang</surname>
          </string-name>
          ,
          <string-name>
            <given-names>W.</given-names>
            <surname>Zhao</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Z.</given-names>
            <surname>Wei</surname>
          </string-name>
          , J. Liu,
          <article-title>SimKGC: Simple Contrastive Knowledge Graph Completion with Pre-trained Language</article-title>
          <string-name>
            <surname>Models</surname>
          </string-name>
          ,
          <year>2022</year>
          . URL: http://arxiv.org/abs/2203.02167, arXiv:
          <fpage>2203</fpage>
          .02167 [cs].
        </mixed-citation>
      </ref>
      <ref id="ref10">
        <mixed-citation>
          [10]
          <string-name>
            <given-names>S.</given-names>
            <surname>Pan</surname>
          </string-name>
          ,
          <string-name>
            <given-names>L.</given-names>
            <surname>Luo</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Y.</given-names>
            <surname>Wang</surname>
          </string-name>
          ,
          <string-name>
            <given-names>C.</given-names>
            <surname>Chen</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Wang</surname>
          </string-name>
          ,
          <string-name>
            <given-names>X.</given-names>
            <surname>Wu</surname>
          </string-name>
          ,
          <article-title>Unifying Large Language Models and Knowledge Graphs: A Roadmap, IEEE Transactions on Knowledge and Data Engineering (</article-title>
          <year>2024</year>
          )
          <fpage>1</fpage>
          -
          <lpage>20</lpage>
          . URL: https://ieeexplore.ieee.org/document/10387715/. doi:
          <volume>10</volume>
          .1109/TKDE.
          <year>2024</year>
          .
          <volume>3352100</volume>
          .
        </mixed-citation>
      </ref>
      <ref id="ref11">
        <mixed-citation>
          [11]
          <string-name>
            <given-names>F.</given-names>
            <surname>Petroni</surname>
          </string-name>
          ,
          <string-name>
            <given-names>T.</given-names>
            <surname>Rocktäschel</surname>
          </string-name>
          ,
          <string-name>
            <given-names>P.</given-names>
            <surname>Lewis</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Bakhtin</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Y.</given-names>
            <surname>Wu</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A. H.</given-names>
            <surname>Miller</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Riedel</surname>
          </string-name>
          ,
          <article-title>Language Models as Knowledge Bases? (</article-title>
          <year>2019</year>
          ). URL: https://arxiv.org/abs/
          <year>1909</year>
          .01066. doi:
          <volume>10</volume>
          .48550/ARXIV.
          <year>1909</year>
          .
          <volume>01066</volume>
          , publisher: [object Object]
          <source>Version Number: 2.</source>
        </mixed-citation>
      </ref>
      <ref id="ref12">
        <mixed-citation>
          [12]
          <string-name>
            <given-names>T.</given-names>
            <surname>Shin</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Y.</given-names>
            <surname>Razeghi</surname>
          </string-name>
          ,
          <string-name>
            <given-names>R. L. L.</given-names>
            <surname>IV</surname>
          </string-name>
          , E. Wallace, S. Singh,
          <article-title>AutoPrompt: Eliciting Knowledge from Language Models with Automatically Generated Prompts</article-title>
          ,
          <year>2020</year>
          . URL: http://arxiv.org/abs/
          <year>2010</year>
          . 15980. doi:
          <volume>10</volume>
          .48550/arXiv.
          <year>2010</year>
          .
          <volume>15980</volume>
          , arXiv:
          <year>2010</year>
          .15980 [cs].
        </mixed-citation>
      </ref>
      <ref id="ref13">
        <mixed-citation>
          [13]
          <string-name>
            <given-names>X. L.</given-names>
            <surname>Li</surname>
          </string-name>
          ,
          <string-name>
            <given-names>P.</given-names>
            <surname>Liang</surname>
          </string-name>
          , Prefix-Tuning:
          <article-title>Optimizing Continuous Prompts for Generation, 2021</article-title>
          . URL: http://arxiv.org/abs/2101.00190. doi:
          <volume>10</volume>
          .48550/arXiv.2101.00190, arXiv:
          <fpage>2101</fpage>
          .00190 [cs].
        </mixed-citation>
      </ref>
      <ref id="ref14">
        <mixed-citation>
          [14]
          <string-name>
            <given-names>S.</given-names>
            <surname>Pitis</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M. R.</given-names>
            <surname>Zhang</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Wang</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Ba</surname>
          </string-name>
          ,
          <source>Boosted Prompt Ensembles for Large Language Models</source>
          ,
          <year>2023</year>
          . URL: http://arxiv.org/abs/2304.05970. doi:
          <volume>10</volume>
          .48550/arXiv.2304.05970, arXiv:
          <fpage>2304</fpage>
          .05970 [cs].
        </mixed-citation>
      </ref>
      <ref id="ref15">
        <mixed-citation>
          [15]
          <string-name>
            <given-names>J.</given-names>
            <surname>Wei</surname>
          </string-name>
          ,
          <string-name>
            <given-names>X.</given-names>
            <surname>Wang</surname>
          </string-name>
          ,
          <string-name>
            <given-names>D.</given-names>
            <surname>Schuurmans</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Bosma</surname>
          </string-name>
          ,
          <string-name>
            <given-names>B.</given-names>
            <surname>Ichter</surname>
          </string-name>
          ,
          <string-name>
            <given-names>F.</given-names>
            <surname>Xia</surname>
          </string-name>
          ,
          <string-name>
            <given-names>E.</given-names>
            <surname>Chi</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Q.</given-names>
            <surname>Le</surname>
          </string-name>
          ,
          <string-name>
            <given-names>D.</given-names>
            <surname>Zhou</surname>
          </string-name>
          , Chain-ofThought
          <source>Prompting Elicits Reasoning in Large Language Models</source>
          ,
          <year>2023</year>
          . URL: http://arxiv.org/abs/ 2201.11903, arXiv:
          <fpage>2201</fpage>
          .11903 [cs].
        </mixed-citation>
      </ref>
      <ref id="ref16">
        <mixed-citation>
          [16]
          <string-name>
            <given-names>A.</given-names>
            <surname>Yang</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Li</surname>
          </string-name>
          ,
          <string-name>
            <given-names>B.</given-names>
            <surname>Yang</surname>
          </string-name>
          ,
          <string-name>
            <given-names>B.</given-names>
            <surname>Zhang</surname>
          </string-name>
          ,
          <string-name>
            <given-names>B.</given-names>
            <surname>Hui</surname>
          </string-name>
          ,
          <string-name>
            <given-names>B.</given-names>
            <surname>Zheng</surname>
          </string-name>
          ,
          <string-name>
            <given-names>B.</given-names>
            <surname>Yu</surname>
          </string-name>
          ,
          <string-name>
            <given-names>C.</given-names>
            <surname>Gao</surname>
          </string-name>
          ,
          <string-name>
            <given-names>C.</given-names>
            <surname>Huang</surname>
          </string-name>
          ,
          <string-name>
            <given-names>C.</given-names>
            <surname>Lv</surname>
          </string-name>
          ,
          <string-name>
            <given-names>C.</given-names>
            <surname>Zheng</surname>
          </string-name>
          ,
          <string-name>
            <given-names>D.</given-names>
            <surname>Liu</surname>
          </string-name>
          ,
          <string-name>
            <given-names>F.</given-names>
            <surname>Zhou</surname>
          </string-name>
          ,
          <string-name>
            <given-names>F.</given-names>
            <surname>Huang</surname>
          </string-name>
          ,
          <string-name>
            <given-names>F.</given-names>
            <surname>Hu</surname>
          </string-name>
          ,
          <string-name>
            <given-names>H.</given-names>
            <surname>Ge</surname>
          </string-name>
          ,
          <string-name>
            <given-names>H.</given-names>
            <surname>Wei</surname>
          </string-name>
          ,
          <string-name>
            <given-names>H.</given-names>
            <surname>Lin</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Tang</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Yang</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Tu</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Zhang</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Yang</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Yang</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Zhou</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Zhou</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Lin</surname>
          </string-name>
          ,
          <string-name>
            <given-names>K.</given-names>
            <surname>Dang</surname>
          </string-name>
          ,
          <string-name>
            <given-names>K.</given-names>
            <surname>Bao</surname>
          </string-name>
          ,
          <string-name>
            <given-names>K.</given-names>
            <surname>Yang</surname>
          </string-name>
          ,
          <string-name>
            <given-names>L.</given-names>
            <surname>Yu</surname>
          </string-name>
          ,
          <string-name>
            <given-names>L.</given-names>
            <surname>Deng</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Li</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Xue</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Li</surname>
          </string-name>
          ,
          <string-name>
            <given-names>P.</given-names>
            <surname>Zhang</surname>
          </string-name>
          ,
          <string-name>
            <given-names>P.</given-names>
            <surname>Wang</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Q.</given-names>
            <surname>Zhu</surname>
          </string-name>
          ,
          <string-name>
            <given-names>R.</given-names>
            <surname>Men</surname>
          </string-name>
          ,
          <string-name>
            <given-names>R.</given-names>
            <surname>Gao</surname>
          </string-name>
          , S. Liu,
          <string-name>
            <given-names>S.</given-names>
            <surname>Luo</surname>
          </string-name>
          ,
          <string-name>
            <given-names>T.</given-names>
            <surname>Li</surname>
          </string-name>
          ,
          <string-name>
            <given-names>T.</given-names>
            <surname>Tang</surname>
          </string-name>
          ,
          <string-name>
            <given-names>W.</given-names>
            <surname>Yin</surname>
          </string-name>
          ,
          <string-name>
            <given-names>X.</given-names>
            <surname>Ren</surname>
          </string-name>
          ,
          <string-name>
            <given-names>X.</given-names>
            <surname>Wang</surname>
          </string-name>
          ,
          <string-name>
            <given-names>X.</given-names>
            <surname>Zhang</surname>
          </string-name>
          ,
          <string-name>
            <given-names>X.</given-names>
            <surname>Ren</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Y.</given-names>
            <surname>Fan</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Y.</given-names>
            <surname>Su</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Y.</given-names>
            <surname>Zhang</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Y.</given-names>
            <surname>Zhang</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Y.</given-names>
            <surname>Wan</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Y.</given-names>
            <surname>Liu</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Z.</given-names>
            <surname>Wang</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Z.</given-names>
            <surname>Cui</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Z.</given-names>
            <surname>Zhang</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Z.</given-names>
            <surname>Zhou</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Z.</given-names>
            <surname>Qiu</surname>
          </string-name>
          ,
          <source>Qwen3 Technical Report</source>
          ,
          <year>2025</year>
          . URL: http://arxiv.org/abs/2505.09388. doi:
          <volume>10</volume>
          . 48550/arXiv.2505.09388, arXiv:
          <fpage>2505</fpage>
          .09388 [cs].
        </mixed-citation>
      </ref>
      <ref id="ref17">
        <mixed-citation>
          [17]
          <string-name>
            <given-names>D.</given-names>
            <surname>Su</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Patwary</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Prabhumoye</surname>
          </string-name>
          ,
          <string-name>
            <given-names>P.</given-names>
            <surname>Xu</surname>
          </string-name>
          ,
          <string-name>
            <given-names>R.</given-names>
            <surname>Prenger</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Shoeybi</surname>
          </string-name>
          ,
          <string-name>
            <given-names>P.</given-names>
            <surname>Fung</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Anandkumar</surname>
          </string-name>
          ,
          <string-name>
            <given-names>B.</given-names>
            <surname>Catanzaro</surname>
          </string-name>
          , Context Generation Improves Open Domain Question Answering,
          <year>2023</year>
          . URL: http: //arxiv.org/abs/2210.06349. doi:
          <volume>10</volume>
          .48550/arXiv.2210.06349, arXiv:
          <fpage>2210</fpage>
          .06349 [cs].
        </mixed-citation>
      </ref>
      <ref id="ref18">
        <mixed-citation>
          7.2.
          <string-name>
            <surname>SyntheticCoT</surname>
          </string-name>
          post-processing
        </mixed-citation>
      </ref>
    </ref-list>
  </back>
</article>