An Overview of Using Large Language Models for the
                                Symbol Grounding Task in ABC Repair System
                                Pak Yin Chan1 , Xue Li1 and Alan Bundy1
                                1
                                    School of Informatics, The University of Edinburgh, United Kingdom


                                                                         Abstract
                                                                         The ABC Theory Repair System (ABC) has demonstrated its success in facilitating users to repair
                                                                         faulty theories utilizing distinct techniques. Yet, comprehending ABC-repaired theories becomes more
                                                                         challenging due to the presence of dummy constants or predicates introduced by ABC. In this paper, we
                                                                         propose a grounding system by incorporating Large Language Models (LLMs) to provide these dummy
                                                                         items with meaningful names. By applying ABC and grounding alternately, the resulting theory is both
                                                                         fault-free and semantically meaningful. Moreover, our study shows that LLMs without fine-tuning still
                                                                         exhibit capabilities of common knowledge, and their grounding performances are enhanced by providing
                                                                         sufficient background or asking for more returns.

                                                                         Keywords
                                                                         Large language model, Closed-book question answering, Faulty logical theory repair, Automated theorem
                                                                         proving


                                1. Introduction
                                Logical theory stands as a reasoning tool in the field of artificial intelligence (AI), representing
                                structured and precise representations of relationships among objects [1]. The theory needs to
                                be refined to cope with new observations [2]. When users introduce novel information, the
                                original theory may either make incorrect predictions or fail to predict the expected truth. [3].
                                   To address faults in flawed logical theories, the ABC Theory Repair System (ABC) was
                                proposed, automatically to integrate several repair techniques like abduction [4], belief revision
                                [5], and conceptual change with reformation [6] based on user observations [2, 3]. Although
                                ABC is capable of generating error-free theories, users might encounter confusion due to the
                                appearance of dummy constants or predicates (hereafter “dummy items”) in these theories.
                                Example 1 illustrates a repaired theory featuring dummy items. We can prove Camilla is equal
                                to Diana in the original theory, which is incompatible with the fact that they are not the same
                                person. To repair this theory, ABC introduces two dummy constants, “dummyConst1” and
                                “dummyConst2”, to differentiate between two types of mothers.
                                   These dummy items come into existence when ABC employs repair plans involving the
                                reformation technique. They are assigned names prefixed with “dummy” as their meanings are
                                unknown to ABC [3]. Presently, users are responsible for manually assigning names to these

                                Cognitive AI 2023, 13th-15th November, 2023, Bari, Italy.
                                Envelope-Open s2341572@ed.ac.uk (P. Y. Chan); xue.shirley.li@ed.ac.uk (X. Li); A.Bundy@ed.ac.uk (A. Bundy)
                                Orcid 0009-0002-9631-5543 (P. Y. Chan); 0000-0002-6665-2242 (X. Li); 0000-0002-0578-6474 (A. Bundy)
                                                                       © 2023 Copyright for this paper by its authors. Use permitted under Creative Commons License Attribution 4.0 International (CC BY 4.0).
                                    CEUR
                                    Workshop
                                    Proceedings
                                                  http://ceur-ws.org
                                                  ISSN 1613-0073
                                                                       CEUR Workshop Proceedings (CEUR-WS.org)


CEUR
                  ceur-ws.org
Workshop      ISSN 1613-0073
Proceedings
dummy items. In the previous example, users can deduce that “dummyConst1” and “dummy-
Const2” are referring to a birth mother and a stepmother respectively. Yet, there are instances
when users face uncertainty in naming these items. Since the current implementation of ABC
does not encompass the consideration of their semantic implications, numerous generated
repaired theories might be logically consistent but devoid of semantic meaning.

   Example 1 : A Comparison of Original and Repaired Motherhood Theory

   Original Theory:
   𝑚𝑢𝑚(𝑋 , 𝑍 ) ∧ 𝑚𝑢𝑚(𝑌 , 𝑍 ) ⟹ 𝑋 = 𝑌
                              ⟹ 𝑚𝑢𝑚(𝑐𝑎𝑚𝑖𝑙𝑙𝑎, 𝑤𝑖𝑙𝑙𝑖𝑎𝑚)
                              ⟹ 𝑚𝑢𝑚(𝑑𝑖𝑎𝑛𝑎, 𝑤𝑖𝑙𝑙𝑖𝑎𝑚)

   Repaired Theory:
   𝑚𝑢𝑚(𝑋 , 𝑍 , 𝑑𝑢𝑚𝑚𝑦𝐶𝑜𝑛𝑠𝑡1) ∧ 𝑚𝑢𝑚(𝑌 , 𝑍 , 𝑑𝑢𝑚𝑚𝑦𝐶𝑜𝑛𝑠𝑡1) ⟹ 𝑋 = 𝑌
                                ⟹ 𝑚𝑢𝑚(𝑐𝑎𝑚𝑖𝑙𝑙𝑎, 𝑤𝑖𝑙𝑙𝑖𝑎𝑚, 𝑑𝑢𝑚𝑚𝑦𝐶𝑜𝑛𝑠𝑡2)
                                ⟹ 𝑚𝑢𝑚(𝑑𝑖𝑎𝑛𝑎, 𝑤𝑖𝑙𝑙𝑖𝑎𝑚, 𝑑𝑢𝑚𝑚𝑦𝐶𝑜𝑛𝑠𝑡1)

   The challenge of attributing meanings to meaningless symbols is known as the “symbol
grounding problem”, an important problem in Cognitive Science [7]. To tackle this problem
automatically, we require tools with access to common-sense knowledge, enabling them to
suggest potential names for users to consider. Although Large Language Models (LLMs) exhibit
inconsistencies in reasoning [8, 9], studies indicate that they store extensive relational knowledge
of the training datasets during pretraining [10, 11, 12]. This suggests that we can leverage LLMs
to propose possible meanings for dummy items by presenting propositions involving these
items in natural language.
   Our study demonstrates an application of LLMs to solve the symbol grounding problem.
The primary objective of this paper is to enrich the semantic content of the repaired theory by
utilizing LLMs to replace the names of dummy items with semantically meaningful content. We
hypothesize that the closed-book question answering (CBQA) task [12] with LLMs helps to
conduct the symbol grounding challenge within ABC, which in the CBQA task, LLMs generate
responses based on their training data solely, without having access to external sources [11, 12].
To explore how well can LLMs provide meanings of dummy items in the repaired theory by
the ABC, as a way of enhancing the semantics of the repaired theory, we propose a system
of symbol grounding for ABC to determine the meanings of dummy items that involve user
interactivity.


2. LLM-grounding system
Figure 1 illustrates the process employed by the grounding system. We first parse the input
theory in Datalog, a subset of First Order Logic (FOL) [13], and the system sets up records of
constants and predicates (G1). These records facilitate the detection of ungrounded dummy
items. For each dummy item, the system interprets the associated axioms into a natural language
question (G2). After grounding with the LLM that users choose (G3), the system presents all
available choices by that LLM. Users can choose the LLM-suggested answers or suggest new
answers by themselves (G4a). After that, the system also recommends users use the existing
theory items with high similarity to any previously suggested answers (G4b). Once all dummy
items are successfully grounded, the system proceeds to replace these items with the selected
answers (G5) and exports all possible grounded theories in Datalog [13].


Figure 1: Flow chart of the grounding system. Modules involving the user’s inputs are coloured yellow.


   Before grounding starts, ABC detects and repairs the fault in the input theory 𝕋 when it
conflicts with the given preferred structure ℙ𝕊, which represents users’ observations [3]. Once
the repair is done, users need to manually copy a repaired theory to start the grounding process.
   We design a heuristic to formulate a prompt in the ungrounded theory. We suspect that
LLMs perform grounding more accurately if we provide sufficient knowledge in the prompt, so
we contain assertions without dummy items and multiple axioms with the same dummy item
in the same prompt. For each proposition with a maximum arity of 3 in axioms, we convert it
into natural language with the interpretation in Table 2 in the Appendix. The interpretation is
similar to [14], except we substitute the dummy name by the item’s type - “property” for the
dummy predicate, and “entity”/“kind” for the dummy constant. We gather the propositions
into rules using conditional sentences and append the specification of the word limit at the end
of the prompt to avoid LLMs returning lengthy answers. Example 2 illustrates the resulting
prompt for grounding 𝑑𝑢𝑚𝑚𝑦𝐶𝑜𝑛𝑠𝑡1 using the above heuristics, with setting the word limit as 5.
   For each grounding of dummy items, users are presented with one to two rounds of rec-
ommendations. The initial recommendations are from the “LLM-grounding Recommender”,
a phase where the LLM’s suggestions are displayed. Users are provided with the option to
directly select the LLM-suggested answers, retain the dummy name, or propose new names.
The inclusion of the latter choice allows users to refine the grounding names based on the
LLM-generated answers or to tailor them to their preferences. In the subsequent phase “Existed-
items Recommender”, the system explores the presence of existing items within the theory
that exhibit high similarity to the chosen or suggested groundings from the previous phase.
This comparison process involves assessing the resemblance of all prior suggestions against
constants or predicates within the theory, depending on the type of dummy item. This phase
employs the F1 score of BERTScore vanilla (referred to as F1 BERTScore) [15] to gauge the
similarity between items. BERTScore utilizes embeddings from the pre-trained BERT model,
calculating the cosine similarity of embeddings to measure word matches between candidates
and references [15]. Higher scores correspond to more significant similarity. Users also have
the flexibility to set a threshold for the F1 BERTScore, enabling the system to recommend items
that surpass the specified F1 BERTScore.

   Example 2: Prompt Formulating in Repaired Tweety Theory

                                   ⟹ 𝑏𝑖𝑟𝑑(𝑝𝑜𝑙𝑙𝑦, 𝑑𝑢𝑚𝑚𝑦𝐶𝑜𝑛𝑠𝑡1)
                      𝑏𝑖𝑟𝑑(𝑋 , 𝑌 ) ⟹ 𝑓 𝑒𝑎𝑡ℎ𝑒𝑟(𝑋 )
   𝑏𝑖𝑟𝑑(𝑋 , 𝑑𝑢𝑚𝑚𝑦𝐶𝑜𝑛𝑠𝑡1) ⟹ 𝑓 𝑙𝑦(𝑋 )
                                   ⟹ 𝑝𝑒𝑛𝑔𝑢𝑖𝑛(𝑡𝑤𝑒𝑒𝑡𝑦)
                    𝑝𝑒𝑛𝑔𝑢𝑖𝑛(𝑋 ) ⟹ 𝑏𝑖𝑟𝑑(𝑋 , 𝑓 𝑙𝑖𝑔ℎ𝑡𝑙𝑒𝑠𝑠)

   Prompt: Given that tweety is a penguin. What is a possible entity, such that polly is a
   bird of the entity, and In a FOL expression, if X is a bird of the entity, then X can fly?
   Answer within 5 wordsa .
         a
             Notice that the names of the penguins Tweety and Polly are not capitalized in the prompt.


  An important consideration is that the grounding process has the potential to reintroduce
faults into repaired theories. In response, we incorporate a safeguard as an extra feature
by aiding users in re-running ABC following the exportation of all grounded theories. This
validation step assesses whether the theories remain free from faults. This iterative approach
involving repair and grounding persists until users attain a satisfactory theory. A practical
example of the interplay between repair and grounding is depicted in the appendix.


3. Grounding Performance
We experimented with GPT-3.5 Turbo (4K context version) and GPT-4 (8K context version)
[10, 16] using OpenAI’s ChatComplete API without further fine-tuning. As ABC is a domain-
independent repair system, we intentionally omitted both fine-tuning and few-shot learning to
gauge how these models perform without such adjustments.
   As ABC uses Datalog, a subset of FOL, the grounding system cannot be evaluated with major
FOL datasets [17, 18]. We compromised to examine the performance of LLMs. We substituted
some items with dummy names in the theories and studied if the LLMs could ground similar
items as the original ones in our system. We constructed theories automatically from two
knowledge bases, enriched WebNLG dataset [19] and excerpt of DART [20], and replaced
some items with dummy names. These theories serve as simulations of the generated repaired
theories, with both assertions and rules. Details of the construction of the evaluation dataset
are in the project’s GitHub repository1 .
   We adopted two semantic-based metrics, F1 BERTScore [15] and SAS [21], to evaluate the
semantic similarity of answers in LLM-grounding Recommender, as they are shown to have
   1
       https://github.com/HistoChan/ABCGrounding
a certain correlation between human judgement [21]. The former one is the same as the one
used in the “Existed-items Recommender”. SAS, however, uses a pre-trained cross-encoder and
applies the model by concatenating two texts with a separator token in between. Different
from BERTScore, SAS considers two inputs together [21]. We calculated the scores of the LLM
outcomes with the original items’ names. All the metrics values range from 0 to 1, with values
closer to 1 indicating greater semantic similarity between candidates and references.

                                                 gpt-3.5-turbo            gpt-4
         Prompt content    number of output
                                               BERTScore SAS        BERTScore     SAS
              basic                1             81.78       8.63     81.73       11.35
         w/ background             1             82.46      12.82     84.48       26.82
         w/ multi axioms           1             81.89       9.54     82.27       15.41
             w/ both               1             82.54      13.22     84.99       30.29
             w/ both               3             84.51      20.31     86.47       37.06
Table 1
Performance of GPT models in LLM-grounding Recommender, where the metrics are in the micro
average percentage point.

  Table 1 is the statistics of the experiment, which shows that an LLM without any fine-tuning
can still have an adequate grounding performance in zero-shot. The increase in the metrics
confirms that the inclusion of additional background information in the prompt can obtain
higher-quality grounding outcomes. The utilization of background information emerges as a
more impactful hint for successful grounding, surpassing the effectiveness of querying multiple
axioms in a single prompt. Moreover, the performances generally enlarge with model size, and
an increase in the number of groundings would correspondingly enhance overall performance.
We also adopted other models such as T5 models by Google [22], OpenLLaMA models from
OpenLM Research [23], and Dolly 2.0 models by Databricks [24]. Their performances also
support the above statements, which the project’s GitHub repository1 contains the statistics of
their performances. Some case studies are in the appendix.


4. Conclusions
In this paper, we have proposed a system of symbol grounding for the ABC repair system.
We formulate the grounding challenge into a CBQA task and require LLMs to return possible
answers. The system also embraces user interactivity, in which users have a right to control the
model use, types of formatting prompts and grounding results. This system not only helps to
enhance the semantics in the repaired theory but also determines the rationality of the repair
plan. Yet, we suspect that the grounding performance is limited by the quality of the prompt,
which can be improved in the future. Additionally, we facilitate using LLMs without either
fine-tuning or few-shot learning for CBQA tasks by providing sufficient background information
and enhancing the number of outputs. Nonetheless, we do not deem that few-shot learning can
be replaced by providing background information. It is worth studying if few-shot learning can
achieve similar performance of fine-tuning, and if the performance enhancement with few-shot
learning is limited by the scale of the model.
References
 [1] A. Barr, E. A. Feigenbaum, The handbook of artificial intelligence, volume 1,
     Butterworth-Heinemann, 1981. URL: https://www.sciencedirect.com/science/article/
     pii/B9780865760899500089.          doi:https://doi.org/10.1016/B978- 0- 86576- 089- 9.
     50008- 9 .
 [2] A. Bundy, X. Li, Representational change is integral to reasoning, Philosophical Trans-
     actions of the Royal Society A: Mathematical, Physical and Engineering Sciences 381
     (2023) 20220052. URL: https://royalsocietypublishing.org/doi/abs/10.1098/rsta.2022.0052.
     doi:10.1098/rsta.2022.0052 .
 [3] X. Li, Automating the Repair of Faulty Logical Theories, 2021.
 [4] C. Sakama, K. Inoue, An abductive framework for computing knowledge base updates,
     Theory and Practice of Logic Programming 3 (2003). doi:10.1017/S1471068403001716 .
 [5] S. O. Hansson, Ten philosophical problems in belief revision, Journal of Logic and
     Computation 13 (2003). doi:10.1093/logcom/13.1.37 .
 [6] A. Bundy, B. Mitrovic, Reformation: A Domain-Independent Algorithm for Theory Repair,
     2016.
 [7] S. Harnad, The Symbol Grounding Problem, 1990. URL: http://cogprints.org/3106/.
 [8] Q. Lyu, S. Havaldar, A. Stein, L. Zhang, D. Rao, E. Wong, M. Apidianaki, C. Callison-
     Burch, Faithful Chain-of-Thought Reasoning, arXiv preprint arXiv:2301.13379 (2023). URL:
     http://arxiv.org/abs/2301.13379.
 [9] J. Wei, X. Wang, D. Schuurmans, M. Bosma, B. Ichter, F. Xia, E. Chi, Q. Le, D. Zhou, Chain-
     of-Thought Prompting Elicits Reasoning in Large Language Models, Advances in Neural
     Information Processing Systems 35 (2022) 24824–24837. URL: http://arxiv.org/abs/2201.
     11903.
[10] T. B. Brown, B. Mann, N. Ryder, M. Subbiah, J. Kaplan, P. Dhariwal, A. Neelakantan,
     P. Shyam, G. Sastry, A. Askell, S. Agarwal, A. Herbert-Voss, G. Krueger, T. Henighan,
     R. Child, A. Ramesh, D. M. Ziegler, J. Wu, C. Winter, C. Hesse, M. Chen, E. Sigler, M. Litwin,
     S. Gray, B. Chess, J. Clark, C. Berner, S. McCandlish, A. Radford, I. Sutskever, D. Amodei,
     Language Models are Few-Shot Learners, Advances in neural information processing
     systems 33 (2020) 1877–1901. URL: http://arxiv.org/abs/2005.14165.
[11] F. Petroni, T. Rocktäschel, P. Lewis, A. Bakhtin, Y. Wu, A. H. Miller, S. Riedel, Language
     Models as Knowledge Bases?, arXiv preprint arXiv:1909.01066 (2019). URL: https://github.
     com/pytorch/fairseq.
[12] A. Roberts, C. Raffel, N. Shazeer, How Much Knowledge Can You Pack Into the Parameters
     of a Language Model?, EMNLP 2020 - 2020 Conference on Empirical Methods in Natural
     Language Processing, Proceedings of the Conference (2020) 5418–5426. URL: https://arxiv.
     org/abs/2002.08910v4. doi:10.18653/v1/2020.emnlp- main.437 .
[13] S. Ceri, G. Gottlob, L. Tanca, What you always wanted to know about Datalog (and never
     dared to ask), IEEE Transactions on Knowledge and Data Engineering 1 (1989) 146–166.
     doi:10.1109/69.43410 .
[14] A. Mpagouli, Converting First Order Logic into Natural Language: A First Level Approach,
     in: Current Trends in Informatics: 11th Panhellenic Conference on Informatics, PCI, 2007,
     pp. 517–526.
[15] T. Zhang, V. Kishore, F. Wu, K. Q. Weinberger, Y. Artzi, BERTScore: Evaluating Text
     Generation with BERT, arXiv preprint arXiv:1904.09675 (2019). URL: https://github.com/
     Tiiiger/bert.
[16] OpenAI, GPT-4 Technical Report, arXiv preprint arXiv:2303.08774 (2023). URL: http:
     //arxiv.org/abs/2303.08774.
[17] S. Han, H. Schoelkopf, Y. Zhao, Z. Qi, M. Riddell, L. Benson, L. Sun, E. Zubova, Y. Qiao,
     M. Burtell, D. Peng, J. Fan, Y. Liu, B. Wong, M. Sailor, A. Ni, L. Nan, J. Kasai, T. Yu,
     R. Zhang, S. Joty, A. R. Fabbri, W. Kryscinski, X. V. Lin, C. Xiong, D. Radev, FOLIO: Natural
     Language Reasoning with First-Order Logic, arXiv preprint arXiv:2209.00840 (2022). URL:
     http://arxiv.org/abs/2209.00840.
[18] J. Tian, Y. Li, W. Chen, L. Xiao, H. He, Y. Jin, Diagnosing the First-Order Logical Reasoning
     Ability Through LogicNLI, Proceedings of the 2021 Conference on Empirical Methods in
     Natural Language Processing (2021) 3738–3747.
[19] T. C. Ferreira, D. Moussallem, S. Wubben, E. Krahmer, Enriching the WebNLG corpus, in:
     Proceedings of the 11th International Conference on Natural Language Generation, 2018,
     pp. 171–176. URL: http://data.statmt.org/wmt17_systems.
[20] L. Nan, D. Radev, R. Zhang, A. Rau, A. Sivaprasad, C. Hsieh, X. Tang, A. Vyas, N. Verma,
     P. Krishna, Y. Liu, N. Irwanto, J. Pan, F. Rahman, A. Zaidi, M. Mutuma, Y. Tarabar, A. Gupta,
     T. Yu, Y. C. Tan, X. V. Lin, C. Xiong, R. Socher, N. F. Rajani, DART: Open-Domain Structured
     Data Record to Text Generation, in: Proceedings of the 2021 Conference of the North
     American Chapter of the Association for Computational Linguistics: Human Language
     Technologies, Association for Computational Linguistics, Online, 2021, pp. 432–447. URL:
     https://aclanthology.org/2021.naacl-main.37. doi:10.18653/v1/2021.naacl- main.37 .
[21] J. Risch, T. Möller, J. Gutsch, M. Pietsch, Semantic Answer Similarity for Evaluating
     Question Answering Models, arXiv preprint arXiv:2108.06130 (2021). URL: http://arxiv.
     org/abs/2108.06130.
[22] C. Raffel, N. Shazeer, A. Roberts, K. Lee, S. Narang, M. Matena, Y. Zhou, W. Li, P. J. Liu,
     Exploring the Limits of Transfer Learning with a Unified Text-to-Text Transformer, Journal
     of Machine Learning Research 21 (2020) 1–67. URL: http://jmlr.org/papers/v21/20-074.html.
[23] X. Geng, H. Liu, OpenLLaMA: An Open Reproduction of LLaMA, 2023. URL: https://github.
     com/openlm-research/open_llama.
[24] M. Conover, M. Hayes, A. Mathur, X. Meng, J. Xie, J. Wan, S. Shah, A. Ghodsi,
     P. Wendell, M. Zaharia, R. Xin, Free Dolly: Introducing the World’s First Truly
     Open Instruction-Tuned LLM, 2023. URL: https://www.databricks.com/blog/2023/04/12/
     dolly-first-open-commercially-viable-instruction-tuned-llm.
A. Interpretation of a proposition in natural language

   Proposition            Interpretation Based on the Part of   Examples
                          Speech of Predicate’s First Word
   pred(Const)            Verb: Const pred .                    fly(bird) → “bird fly.”
                          Others: Const is pred.
   pred(Const1,           Verb: Const1 pred Const2 .            capitalOf(london, england) → “lon-
   Const2)                Others: Const is pred of Const2.      don is capital of of england.”
   pred(Const1,           Verb: Const1 pred (Const3)            mother(diana, william, dum-
   Const2, Const3)        Const2 .                              myNormal) → “diana is mother
                          Others: Const is pred (Const3) of     (kind) of william.”
                          Const2.

Table 2
Interpretations of propositions in natural language, where constants can be replaced by variables.
Dummy items are renamed with their types, and grammar mistakes are ignored.


B. Example of Interplay of Repair and Grounding
We provide the highlight on how the repair and grounding processes collectively address
conflicts and lead to the attainment of a desirable theory in Example 3.

   Example 3: Repair and Grounding a Capital Theory

   𝑐𝑎𝑝𝑖𝑡𝑎𝑙𝑂𝑓 (𝑋 , 𝑌 ) ∧ 𝑐𝑎𝑝𝑖𝑡𝑎𝑙𝑂𝑓 (𝑍 , 𝑌 ) ⟹ 𝑋 = 𝑍
                                       ⟹ 𝑐𝑎𝑝𝑖𝑡𝑎𝑙𝑂𝑓 (𝑒𝑑𝑖𝑛𝑏𝑢𝑟𝑔ℎ, 𝑒𝑛𝑔𝑙𝑎𝑛𝑑)
                                       ⟹ 𝑐𝑎𝑝𝑖𝑡𝑎𝑙𝑂𝑓 (𝑔𝑙𝑎𝑠𝑔𝑜𝑤, 𝑠𝑐𝑜𝑡𝑙𝑎𝑛𝑑)
                                       ⟹ 𝑐𝑎𝑝𝑖𝑡𝑎𝑙𝑂𝑓 (𝑙𝑜𝑛𝑑𝑜𝑛, 𝑒𝑛𝑔𝑙𝑎𝑛𝑑)
   𝒯 (ℙ𝕊) = ∅,
   ℱ (ℙ𝕊) = {𝑒𝑑𝑖𝑛𝑏𝑢𝑟𝑔ℎ = 𝑙𝑜𝑛𝑑𝑜𝑛, 𝑙𝑜𝑛𝑑𝑜𝑛 = 𝑒𝑑𝑖𝑛𝑏𝑢𝑟𝑔ℎ, 𝑔𝑙𝑎𝑠𝑔𝑜𝑤 = 𝑒𝑑𝑖𝑛𝑏𝑢𝑟𝑔ℎ,
   𝑔𝑙𝑎𝑠𝑔𝑜𝑤 = 𝑙𝑜𝑛𝑑𝑜𝑛, 𝑒𝑑𝑖𝑛𝑏𝑢𝑟𝑔ℎ = 𝑔𝑙𝑎𝑠𝑔𝑜𝑤, 𝑙𝑜𝑛𝑑𝑜𝑛 = 𝑔𝑙𝑎𝑠𝑔𝑜𝑤}

   Step 1: ABC finds there is a fault in having two capitals in Eng-
   land, which it suggests replacing “england” with “dummyEngland1”
   in 𝑐𝑎𝑝𝑖𝑡𝑎𝑙𝑂𝑓 (𝑒𝑑𝑖𝑛𝑏𝑢𝑟𝑔ℎ, 𝑒𝑛𝑔𝑙𝑎𝑛𝑑), and it is grounded as “scotland”:
   𝑐𝑎𝑝𝑖𝑡𝑎𝑙𝑂𝑓 (𝑋 , 𝑌 ) ∧ 𝑐𝑎𝑝𝑖𝑡𝑎𝑙𝑂𝑓 (𝑍 , 𝑌 ) ⟹ 𝑋 = 𝑍
                                       ⟹ 𝑐𝑎𝑝𝑖𝑡𝑎𝑙𝑂𝑓 (𝑒𝑑𝑖𝑛𝑏𝑢𝑟𝑔ℎ, 𝑠𝑐𝑜𝑡𝑙𝑎𝑛𝑑)
                                       ⟹ 𝑐𝑎𝑝𝑖𝑡𝑎𝑙𝑂𝑓(𝑔𝑙𝑎𝑠𝑔𝑜𝑤, 𝑠𝑐𝑜𝑡𝑙𝑎𝑛𝑑)
                                       ⟹ 𝑐𝑎𝑝𝑖𝑡𝑎𝑙𝑂𝑓 (𝑙𝑜𝑛𝑑𝑜𝑛, 𝑒𝑛𝑔𝑙𝑎𝑛𝑑)
   Example 3 (Continue): Repair and Grounding a Capital Theory

   Step 2: ABC finds there is a fault in having two capitals in Scot-
   land, which it suggests replacing “capitalOf” with “dummyPred”
   in 𝑐𝑎𝑝𝑖𝑡𝑎𝑙𝑂𝑓(𝑔𝑙𝑎𝑠𝑔𝑜𝑤, 𝑠𝑐𝑜𝑡𝑙𝑎𝑛𝑑),         and it is grounded as ‘cityOf‘:
   𝑐𝑎𝑝𝑖𝑡𝑎𝑙𝑂𝑓 (𝑋 , 𝑌 ) ∧ 𝑐𝑎𝑝𝑖𝑡𝑎𝑙𝑂𝑓 (𝑍 , 𝑌 ) ⟹ 𝑋 = 𝑍
                                               ⟹ 𝑐𝑎𝑝𝑖𝑡𝑎𝑙𝑂𝑓 (𝑒𝑑𝑖𝑛𝑏𝑢𝑟𝑔ℎ, 𝑠𝑐𝑜𝑡𝑙𝑎𝑛𝑑)
                                               ⟹ 𝑐𝑖𝑡𝑦𝑂𝑓(𝑔𝑙𝑎𝑠𝑔𝑜𝑤, 𝑠𝑐𝑜𝑡𝑙𝑎𝑛𝑑)
                                               ⟹ 𝑐𝑎𝑝𝑖𝑡𝑎𝑙𝑂𝑓 (𝑙𝑜𝑛𝑑𝑜𝑛, 𝑒𝑛𝑔𝑙𝑎𝑛𝑑)


C. Examples of Grounding Results
We compare the performance of grounding containing background information of two models
in Example 4, which illustrates that the grounding would be more reasonable with providing
background information.

   Example 4: A comparison of grounding answers of Example 2

       • Without extra content: What is a possible entity, such that opus is broken wing of
         the entity? Answer within 5 wordsa .
                – GPT-3.5 Turbo: Defective wing.
                – T5 Large (NQ): Feathers are
       • With background information: Given that opus is super penguin. What is a
         possible entity, such that opus is broken wing of the entity? Answer within 5
         wordsa .
                – GPT-3.5 Turbo: X is injured
                – T5 Large (NQ): Cannot fly
       a
           Notice that the name of the penguin Opus is not capitalized in the prompt.


   We provide Example 5 as a comparison of grounding performance on different LLMs, in
which the answers are more accurate with larger model sizes. Despite our explicit instruction
to return answers within five words and request of the maximum output tokens as five, Dolly
2.0 and OpenLLaMA still have a high tendency to return a complete sentence.
   Example 5: A comparison of grounding answers of a repaired Capital Theory

   Question: What is a possible entity such that edinburgh is cap of of the entity? Answer
   within 5 words.

       • Dolly 2.0 3B: Edinburgh is the capital of
       • Dolly 2.0 7B: The answer is the Edinburgh
       • OpenLLaMA 3B: edinburgh is cap of
       • OpenLLaMA 7B: The answer is Scotland.
       • GPT-3.5 Turbo & T5 XL (NQ): Scotland
       • GPT-4: Scotland or United Kingdom.
       • T5 Small & Large : Edinburgh
       • T5 Small (NQ): other social entity
       • T5 Large (NQ): Kingdom of Scotland

  We experimented on the effect of the number of groundings generated from LLM. This
experiment lay in the diversity of the returned results. With open-ended questions like that in
Example 6, the answers returned reflect the multifaceted nature of potential responses. The
augmented number of returned answers not only aids in identifying high-quality grounding
but also empowers users to brainstorm a broader spectrum of possible groundings.

   Example 6: A comparison of answers of prompt from a repaired Tweety Theory

   Question: Given that tweety is penguin. What is a possible entity such that polly is bird
   of the entity, and In a FOL expression, if x is bird of the entity, then x is fly? Answer
   within 5 words.a .

       • GPT-3.5 Turbo: “Airplane.”, “Flying creature like parrot”, “Fish”
       • GPT-4: “Sky or Air could be”, “Possible entity is ’the”, “The possible entity: magical”
       • Suggested Answer: “flying”
       a
           Notice that the names of the penguins Tweety and Polly are not capitalized in the prompt.