<!DOCTYPE article PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Archiving and Interchange DTD v1.0 20120330//EN" "JATS-archivearticle1.dtd">
<article xmlns:xlink="http://www.w3.org/1999/xlink">
  <front>
    <journal-meta />
    <article-meta>
      <title-group>
        <article-title>with Large Language Models - Initial Experiments and Future Directions</article-title>
      </title-group>
      <contrib-group>
        <contrib contrib-type="author">
          <string-name>Grigoris Antoniou</string-name>
          <email>G.Antoniou@hud.ac.uk</email>
          <xref ref-type="aff" rid="aff0">0</xref>
          <xref ref-type="aff" rid="aff2">2</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Sotiris Batsakis</string-name>
          <email>sbatsakis@tuc.gr</email>
          <xref ref-type="aff" rid="aff1">1</xref>
          <xref ref-type="aff" rid="aff2">2</xref>
        </contrib>
        <contrib contrib-type="editor">
          <string-name>Defeasible Reasoning, Large Language Models, Non-monotonic Reasoning</string-name>
        </contrib>
        <aff id="aff0">
          <label>0</label>
          <institution>L3S, Leibniz University Hanover</institution>
          ,
          <country country="DE">Germany</country>
        </aff>
        <aff id="aff1">
          <label>1</label>
          <institution>Technical University of Crete</institution>
          ,
          <country country="GR">Greece</country>
        </aff>
        <aff id="aff2">
          <label>2</label>
          <institution>University of Huddersfield</institution>
          ,
          <country country="UK">UK</country>
        </aff>
      </contrib-group>
      <pub-date>
        <year>2023</year>
      </pub-date>
      <abstract>
        <p>As Large Language Models gain prominence in the AI landscape, it is essential to understand their capabilities and limitations, among others in terms of reasoning. This paper is a first step towards understanding the capabilities in terms of defeasible rule-based reasoning. It presents results of initial experiments and discussed future research directions. RuleML+RR'23: 17th International Rule Challenge and 7th Doctoral Consortium, September 18-20, 2023, Oslo, Norway htp:/ceur-ws.org CEUR Workshop Proceedings (CEUR-WS.org) ISN1613-073</p>
      </abstract>
      <kwd-group>
        <kwd>Directions</kwd>
      </kwd-group>
    </article-meta>
  </front>
  <body>
    <sec id="sec-1">
      <title>1. Introduction</title>
      <p>Large Language Models (LLMs) have risen in prominence recently due to their exceptional
performance in various language related tasks since they are the underlying technology behind
chat bots such as ChatGPT1. Large Language models such as LaMDA [1] and GPT [2] are based
on training deep neural networks with billions of parameters over huge lexical datasets and
often employing human judgment in a semi-supervised (e.g., reinforcement learning) training
setting [3, 4]. The exceptional -human level- performance of LLMs in various tasks has led to a
widespread discussion about the potential benefits and dangers of such technologies in various
areas and human society in general including petitions to pause research on more capable LLMs
[5].</p>
      <p>Despite their impressive performance Large language Models, including their ability to
demonstrate an emerging intelligent behaviour and reasoning capabilities, leading to the point
of considering them forerunners of Artificial General Intelligence[ 6] several issues related to
LLMs have been identified, such the energy cost of training LLMs[ 7, 8], dificulty to control
their behaviour[9], ensure conformity with stakeholders requirements and norms and interpret
their functionality [10]. The interpetability of LLMs is crucial issue since neural network based
LLMs appear to be ‘black boxes’, in contract to logic based systems, and although various
attempts exist to deal with this problem, including the use of LLMs to interpret LLMs [11],
this is still an unresolved issue. In addition, since LLMs are based on vast amount of raw text
CEUR
Workshop
Proceedings
they tend to replicate their input rather that apply robust reasoning [12]. Based on raw text
instead of structured knowledge bases integrating machine readable semantics, contributes
to the dificulty to achieve eficient reasoning and this is an issue examined in various works
such as [13] and surveyed in [14]. Various attempts to integrate Knowledge graphs (KGs) to
LLMs have been proposed [15, 16] as a solution to the last issue, but recent advances in LLMs
capabilities, including high performance on academic and professional exams [2], illustrated the
need for an updated evaluation of the reasoning capabilities of LLMs. This updated evaluation
should take into account the recent developments in the field, including the deployment of
systems such as chatGPT employing the benefits of scalability [ 17] and the LLMs demonstrated
ability to adjust to new tasks given just a small number of examples [18]. Furthermore LLMs
capabilities with respect to important formalisms such as defeasible reasoning have not been
examined in detail yet.</p>
      <p>This work is an initial step towards developing a deep understanding of reasoning capabilities
of LLMs with emphasis of nonmonotonic reasoning. For achieving this we conducted
experiments on nonmonotonic rule-based reasoning using chatGPT and compare it with performance
of dedicated reasoning systems and human users. This kind of reasoning is important for cases
where knowledge is incomplete and conflicting which is the case in many application areas,
including law and healthcare.</p>
    </sec>
    <sec id="sec-2">
      <title>2. Basic defeasible reasoning</title>
      <p>
        Defeasible reasoning is a form of nonmonotonic reasoning [19] that seeks to draw conclusions
in the absence of contrary information, conclusions that may be invalidated by later additions
to the knowledge base. A prototypical example regards the flying abilities of Tweety the bird:
(
        <xref ref-type="bibr" rid="ref1">1</xref>
        ) Birds typically fly. Tweety is a bird. Does Tweety fly?
Unsurprisingly, ChatGPT is able to provide the correct answer and a meaningful explanation.
This remains the same even if we use abstract names (A instead of bird) or imaginary species
on an imaginary planet (e.g. zillos). Now let us add conflicting information.
(
        <xref ref-type="bibr" rid="ref2">2</xref>
        ) Republicans are typically hawks. Quakers are typically not hawks. Nixon is both a republican
and a Quaker. Is Nixon a hawk?
Again, ChatGPT responds correctly and with a reasonable explanation. Now let’s look at a
logically equivalent question, but which has an implicit specificity derived from other
commonsense knowledge.
(
        <xref ref-type="bibr" rid="ref3">3</xref>
        ) Birds are typically flying animals. Penguins are typically not flying animals. Tweety is
both a bird and a penguin. Is Tweety a flying animal?
Not unsurprisingly, ChatGPT makes use of background knowledge to say that penguins are
lfightless birds. Even when asked to reply based solely on the information presented, it refers
to penguins as a subclass of birds. So the response is correct but the explanation is flawed.
This observation is consistent with what has been already observed by other researchers; for
example, [13] established that language models do not learn to reason per se but learn statistical
patterns inherent in logical problems.
      </p>
    </sec>
    <sec id="sec-3">
      <title>3. Reasoning with exceptions</title>
      <p>One use of defeasible reasoning is reasoning with rules and exceptions, an exception seen as
more reliable than the general rule for the cases it applies to. We used an example from the
domain of law where rules with exceptions prevail [20], specifically the use case presented in
[21]:
User: So the setting is this: Suppose that a Privacy Act contains the following norms: Section
1. The collection of personal information is forbidden, unless acting on a court order authorising it.
Section 2. The destruction of illegally collected personal information before accessing it is a defence
against the illegal collection of the personal information. Section 3. The collection of medical
information is forbidden, unless the entity collecting the medical information is permitted to collect
personal information. Suppose an entity, subject to the Act, collects some personal information
without being permitted to do so; at the same time they collect medical information. The entity
recognises that they illegally collected personal information (i.e., they collected the information
without being authorised to do so by a Court Order) and decides to remediate the illegal collection
by destroying the information before accessing it. Is the entity compliant with the Privacy Act
above?
ChatGPT: Based on the information you’ve provided, the entity in question collected personal
information without being permitted to do so and also collected medical information. According to
Section 3 of the Privacy Act, the collection of medical information is forbidden unless the entity
collecting it is permitted to collect personal information.</p>
      <p>In addition, according to Section 1 of the Privacy Act, the collection of personal information is
forbidden unless authorized by a court order. Therefore, the entity in question would be in violation
of the Privacy Act for both the collection of personal information without a court order and the
collection of medical information without being authorized to collect personal information.</p>
      <p>While the entity’s decision to destroy the illegally collected personal information before accessing
it may be seen as a defence under Section 2 of the Privacy Act, it’s important to note that this
defence is only available if the collection of the personal information was illegal in the first place.
In this case, the collection of personal information was indeed illegal, so the defence would apply.
However, the entity would still be in violation of the Privacy Act for the collection of medical
information without being permitted to do so. Therefore, the entity would not be fully compliant
with the Privacy Act in this scenario.</p>
      <p>This is an impressive reply from the LLM! It provides both a correct response and an
informative analysis of the situation and justification for its conclusion.</p>
    </sec>
    <sec id="sec-4">
      <title>4. Rule strength and specificity</title>
      <p>Next we consider examples involving strict and defeasible rules, conflicts and where specificity
may play a role. Given the already identified afinity of ChatGPT to use other background
knowledge when predicates and atoms are real-world entities, we use imaginary names of
species on an imaginary planet, following [22]. This way, we wanted to test pure reasoning
capabilities that may be indicative of what to expect in completely new situations.</p>
      <p>
        (
        <xref ref-type="bibr" rid="ref5">5</xref>
        ) Hittas are typically not wafs. All hittas are oxers. Oxers are usually wafs. Jukk is a hitta. Is
Jukk a waf?
      </p>
      <p>
        (
        <xref ref-type="bibr" rid="ref6">6</xref>
        ) Wilfons are typically not kinglers. Wilfons are typically brindops. All brindops are kiglers.
Floxxi is a wilfon. Is Floxxi a kigler?
      </p>
      <p>
        (
        <xref ref-type="bibr" rid="ref7">7</xref>
        ) Zugs are typically not vlogs. Zugs are typically storps. Storps are typically vlogs. Duss is a
zug. Is Duss a vlog?
      </p>
      <p>
        (
        <xref ref-type="bibr" rid="ref8">8</xref>
        ) Humnols are typically not crerks. All posders are twerbers. Twerbers are typically crerks.
Vouncy is a humnol and a posder. Is Vouncy a crerk?
      </p>
      <p>
        (
        <xref ref-type="bibr" rid="ref9">9</xref>
        ) Arkons are typically not gakks. Jaggas are typically wollers. All wollers are gakks. Fertha is
an arkon and a jagga. Is Fertha a gakk?
      </p>
      <p>
        (
        <xref ref-type="bibr" rid="ref10">10</xref>
        ) Voltners are typically not zillos. Kikkas are typically crolders. Crolders are typically zillos.
Grolli is a voltner and a kikka. Is Grolli a zillo?
      </p>
      <p>
        A graphical representation of questions (
        <xref ref-type="bibr" rid="ref5">5</xref>
        )-(
        <xref ref-type="bibr" rid="ref10">10</xref>
        ) is shown in Figure 1. Three answers were
possible for these questions (all of the form “Is X a Y”?): a: It is more reasonable to conclude
that X is a Y b: It is more reasonable to conclude that X is not a Y c: Neither of the two answers
above are reasonable. Table 1 summarises the responses to these questions as compared to the
outcomes by nonmonotonic systems23 .
      </p>
      <p>The replies of ChatGPT show a general disregard of specificity. Seeking to further understand
the reason, positive and negative conclusions were inverted, but all replies were consistent
so no preference of positive over negative conclusions was observed. However, there was a
dramatic change in behavior the moment the order of statements was inverted.</p>
      <p>
        (
        <xref ref-type="bibr" rid="ref11">11</xref>
        ) Kikkas are typically crolders. Crolders are typically zillos. Voltners are typically not zillos.
Grolli is a voltner and a kikka. Is Grolli a zillo?
      </p>
      <p>Now the answer b was provided! The explanation provided by ChatGPT provided the evidence
supporting this conclusion. So, our initial investigation revealed serious weaknesses in terms of
reasoning, in that the system seems to not appreciate the need to carefully weigh the pros and
cons of a conclusion making use of implicit specificity.</p>
      <p>
        But most interestingly, once ChatGPT is explicitly told about specificity, its result is updated
correctly. We observed that for example (
        <xref ref-type="bibr" rid="ref5">5</xref>
        ) where ChatGPT was prompted as follows: Yes but
you did not consider the fact that being a hitta is more specific to Jukk than being an oxer. Would
you wish to reconsider your answer?. ChatGPT proceeded to apply specificity correctly.
      </p>
    </sec>
    <sec id="sec-5">
      <title>5. Human reasoning</title>
      <p>
        It is interesting to contrast the performance of LLMs not only against theories of defeasible
reasoning but also the way human actually reason. Although nonmonotonic reasoning was
meant to enable “jumping into conclusions”, the formal theories developed are very complex
and humans actually find this kind of reasoning challenging, as research into human cognition
has established. The examples (
        <xref ref-type="bibr" rid="ref5">5</xref>
        ) – (
        <xref ref-type="bibr" rid="ref10">10</xref>
        ) were taken out of [22] where the replies of human
participants were collected and analysed in depth. The following table contrasts the answers of
all three.
      </p>
      <p>The results demonstrate that the discrepancy of human reasoning from the outcomes
postulated by KR is similar to the discrepancy between ChatGPT and KR. As we move to better
understand defeasible reasoning capabilities of LLMs, in is very interesting to work out the
2Nonmonotonic systems like defeasible logics or default logic with priorities representing the concept of
specificity.</p>
      <p>
        3The input for question (
        <xref ref-type="bibr" rid="ref6">6</xref>
        ) was as follows: ”I will be talking about an imaginary planet where imaginary species
live. These species have strange-looking names. Now suppose you are told the following information: Hittas are typically
not wafs. All hittas are oxers. Oxers are usually wafs. Jukk is a hitta. Is it (a) more reasonable or likely to conclude
that Jukk is a waf than not? or (b) more reasonable or likely to conclude that Jukk is not a waf than the opposite, or (c)
is it not reasonable or likely to conclude any of the replies (
        <xref ref-type="bibr" rid="ref1">1</xref>
        ) and (
        <xref ref-type="bibr" rid="ref2">2</xref>
        ) based on the information provided?”
relative strengths and weaknesses between KR approaches, human cognition and LLMs. A
starting point could be existing works looking at the relation between KR approaches and
human reasoning, e.g. [23, 24, 22, 25].
      </p>
    </sec>
    <sec id="sec-6">
      <title>6. Topics for future research</title>
      <p>This is a first step towards our goal of understanding the reasoning capabilities of Large Language
Models. In this paper we focused on defeasible reasoning and conducted some experiments with
ChatGPT. The results show a mixed picture: basic principles are respected, explanation facilities
are superb and performance on real-world entities is good. On the negative side, we observed
varying behavior based on the names of predicates (real world or artificial) and reliance on
syntactic aspects. Clearly, there is a lot of research to be done. In the following, we highlight a
number of research directions.</p>
      <p>Prompt engineering. We observed that using abstract or real-world names has an efect on
the outcomes of the reasoning, similarly to the sequence of statements. Additionally, pointing
to the principle of specificity had a desired change in outcome. All this points to the fact that
we need a better understanding on how to interact with LLMs, an observation that has been
made in many other areas of use of LLMs.</p>
      <p>KR vs human cognition vs LLMs. Knowledge representation sets principles of sound
reasoning but humans reason in diferent ways, indeed there is an increasing body of work
seeking to understand the diferences between logical reasoning and human reasoning. LLMs
bring an additional dimension, and it interesting to contrast the three approaches. Apart from
this being a pure research question, we can hope to gain insights into how LLMs and humans
can be taught reasoning, and into how conversational interaction about reasoning outcomes
with LLMs and humans can take place.</p>
      <p>Benchmarking reasoning capabilities of LLMs. Overall, there is a need for a systematic
analysis of reasoning schemes, chains of reasoning etc. Developing relevant benchmarks is
an important vehicle to this end and we intend to work on it. Our initial work reported here
is about nonmonotonic rule-based reasoning but we have longer-term plans to investigate
and benchmark other forms of reasoning, including first order reasoning, epistemic reasoning,
reasoning about change, reasoning about action and reasoning about time.</p>
      <p>Acknowledgments
This research was funded by the Federal Ministry of Education and Research (BMBF), Germany
under the project LeibnizKILabor with grant No. 01DD20003.
[14] J. Huang, K. C.-C. Chang, Towards reasoning in large language models: A survey, arXiv
preprint arXiv:2212.10403 (2022).
[15] C. Zhen, Y. Shang, X. Liu, Y. Li, Y. Chen, D. Zhang, A survey on knowledge-enhanced
pre-trained language models, arXiv preprint arXiv:2212.13428 (2022).
[16] D. Yin, L. Dong, H. Cheng, X. Liu, K.-W. Chang, F. Wei, J. Gao, A survey of
knowledgeintensive nlp with pre-trained language models, arXiv preprint arXiv:2202.08772 (2022).
[17] J. Kaplan, S. McCandlish, T. Henighan, T. B. Brown, B. Chess, R. Child, S. Gray, A. Radford,</p>
      <p>J. Wu, D. Amodei, Scaling laws for neural language models, 2020. arXiv:2001.08361.
[18] T. B. Brown, B. Mann, N. Ryder, M. Subbiah, J. Kaplan, P. Dhariwal, A. Neelakantan,
P. Shyam, G. Sastry, A. Askell, S. Agarwal, A. Herbert-Voss, G. Krueger, T. Henighan,
R. Child, A. Ramesh, D. M. Ziegler, J. Wu, C. Winter, C. Hesse, M. Chen, E. Sigler, M. Litwin,
S. Gray, B. Chess, J. Clark, C. Berner, S. McCandlish, A. Radford, I. Sutskever, D. Amodei,
Language models are few-shot learners, 2020. arXiv:2005.14165.
[19] G. Antoniou, M.-A. Williams, Nonmonotonic reasoning, Mit Press, 1997.
[20] G. Governatori, M. Hashmi, No time for compliance, in: 2015 IEEE 19th International</p>
      <p>Enterprise Distributed Object Computing Conference, IEEE, 2015, pp. 9–18.
[21] G. Governatori, The regorous approach to process compliance, in: 2015 IEEE 19th</p>
      <p>International Enterprise Distributed Object Computing Workshop, IEEE, 2015, pp. 33–40.
[22] M. Ford, D. Billington, Strategies in human nonmonotonic reasoning, Computational</p>
      <p>Intelligence 16 (2000) 446–468.
[23] C. K. Baker, C. Denny, P. Freund, T. Meyer, Cognitive defeasible reasoning: the extent
to which forms of defeasible reasoning correspond with human reasoning, in:
Artificial Intelligence Research: First Southern African Conference for AI Research, SACAIR
2020, Muldersdrift, South Africa, February 22-26, 2021, Proceedings 1, Springer, 2020, pp.
199–219.
[24] M. Ford, Human nonmonotonic reasoning: The importance of seeing the logical strength
of arguments, Synthese 146 (2005) 71–92.
[25] M. Ragni, C. Eichhorn, T. Bock, G. Kern-Isberner, A. P. P. Tse, Formal nonmonotonic
theories and properties of human defeasible reasoning, Minds and Machines 27 (2017)
79–117.</p>
    </sec>
  </body>
  <back>
    <ref-list>
      <ref id="ref1">
        <mixed-citation>
          [1]
          <string-name>
            <given-names>R.</given-names>
            <surname>Thoppilan</surname>
          </string-name>
          ,
          <string-name>
            <surname>D. De Freitas</surname>
            ,
            <given-names>J.</given-names>
          </string-name>
          <string-name>
            <surname>Hall</surname>
            ,
            <given-names>N.</given-names>
          </string-name>
          <string-name>
            <surname>Shazeer</surname>
            ,
            <given-names>A.</given-names>
          </string-name>
          <string-name>
            <surname>Kulshreshtha</surname>
            , H.-
            <given-names>T.</given-names>
            Cheng, A. Jin, T.
          </string-name>
          <string-name>
            <surname>Bos</surname>
            ,
            <given-names>L.</given-names>
          </string-name>
          <string-name>
            <surname>Baker</surname>
            ,
            <given-names>Y.</given-names>
          </string-name>
          <string-name>
            <surname>Du</surname>
          </string-name>
          , et al.,
          <article-title>Lamda: Language models for dialog applications</article-title>
          ,
          <source>arXiv preprint arXiv:2201.08239</source>
          (
          <year>2022</year>
          ).
        </mixed-citation>
      </ref>
      <ref id="ref2">
        <mixed-citation>
          [2] OpenAI, Gpt-4
          <source>technical report</source>
          ,
          <year>2023</year>
          . arXiv:
          <volume>2303</volume>
          .
          <fpage>08774</fpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref3">
        <mixed-citation>
          [3]
          <string-name>
            <given-names>N.</given-names>
            <surname>Lambert</surname>
          </string-name>
          ,
          <string-name>
            <given-names>L.</given-names>
            <surname>Castricato</surname>
          </string-name>
          , L. von
          <string-name>
            <surname>Werra</surname>
            ,
            <given-names>A.</given-names>
          </string-name>
          <string-name>
            <surname>Havrilla</surname>
          </string-name>
          ,
          <article-title>Illustrating reinforcement learning from human feedback (rlhf</article-title>
          ),
          <source>Hugging Face Blog</source>
          (
          <year>2022</year>
          ). Https://huggingface.co/blog/rlhf.
        </mixed-citation>
      </ref>
      <ref id="ref4">
        <mixed-citation>
          [4]
          <string-name>
            <given-names>L.</given-names>
            <surname>Ouyang</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Wu</surname>
          </string-name>
          ,
          <string-name>
            <given-names>X.</given-names>
            <surname>Jiang</surname>
          </string-name>
          ,
          <string-name>
            <given-names>D.</given-names>
            <surname>Almeida</surname>
          </string-name>
          ,
          <string-name>
            <given-names>C. L.</given-names>
            <surname>Wainwright</surname>
          </string-name>
          ,
          <string-name>
            <given-names>P.</given-names>
            <surname>Mishkin</surname>
          </string-name>
          ,
          <string-name>
            <given-names>C.</given-names>
            <surname>Zhang</surname>
          </string-name>
          , S. Agarwal,
          <string-name>
            <given-names>K.</given-names>
            <surname>Slama</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Ray</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Schulman</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Hilton</surname>
          </string-name>
          ,
          <string-name>
            <given-names>F.</given-names>
            <surname>Kelton</surname>
          </string-name>
          ,
          <string-name>
            <given-names>L.</given-names>
            <surname>Miller</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Simens</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Askell</surname>
          </string-name>
          ,
          <string-name>
            <given-names>P.</given-names>
            <surname>Welinder</surname>
          </string-name>
          ,
          <string-name>
            <given-names>P.</given-names>
            <surname>Christiano</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Leike</surname>
          </string-name>
          ,
          <string-name>
            <given-names>R.</given-names>
            <surname>Lowe</surname>
          </string-name>
          ,
          <article-title>Training language models to follow instructions with human feedback</article-title>
          ,
          <year>2022</year>
          . arXiv:
          <volume>2203</volume>
          .
          <fpage>02155</fpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref5">
        <mixed-citation>
          [5]
          <string-name>
            <given-names>F. O.</given-names>
            <surname>Letters</surname>
          </string-name>
          ,
          <article-title>Pause giant ai experiments: An open letter, Future of Life Institution</article-title>
          . https://futureoflife. org/open-letter/
          <article-title>pause-giant-ai-</article-title>
          <string-name>
            <surname>experiments</surname>
          </string-name>
          (
          <year>2023</year>
          ).
        </mixed-citation>
      </ref>
      <ref id="ref6">
        <mixed-citation>
          [6]
          <string-name>
            <given-names>S.</given-names>
            <surname>Bubeck</surname>
          </string-name>
          ,
          <string-name>
            <given-names>V.</given-names>
            <surname>Chandrasekaran</surname>
          </string-name>
          ,
          <string-name>
            <given-names>R.</given-names>
            <surname>Eldan</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Gehrke</surname>
          </string-name>
          ,
          <string-name>
            <given-names>E.</given-names>
            <surname>Horvitz</surname>
          </string-name>
          , E. Kamar,
          <string-name>
            <given-names>P.</given-names>
            <surname>Lee</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Y. T.</given-names>
            <surname>Lee</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Y.</given-names>
            <surname>Li</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Lundberg</surname>
          </string-name>
          , et al.,
          <source>Sparks of artificial general intelligence: Early experiments with gpt-4</source>
          , arXiv preprint arXiv:
          <volume>2303</volume>
          .12712 (
          <year>2023</year>
          ).
        </mixed-citation>
      </ref>
      <ref id="ref7">
        <mixed-citation>
          [7]
          <string-name>
            <given-names>A. S.</given-names>
            <surname>Luccioni</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Viguier</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.-L.</given-names>
            <surname>Ligozat</surname>
          </string-name>
          ,
          <article-title>Estimating the carbon footprint of bloom, a 176b parameter language model</article-title>
          ,
          <source>arXiv preprint arXiv:2211</source>
          .
          <year>02001</year>
          (
          <year>2022</year>
          ).
        </mixed-citation>
      </ref>
      <ref id="ref8">
        <mixed-citation>
          [8]
          <string-name>
            <given-names>E.</given-names>
            <surname>Strubell</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Ganesh</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>McCallum</surname>
          </string-name>
          ,
          <article-title>Energy and policy considerations for deep learning in nlp</article-title>
          ,
          <source>in: Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics</source>
          ,
          <year>2019</year>
          , pp.
          <fpage>3645</fpage>
          -
          <lpage>3650</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref9">
        <mixed-citation>
          [9]
          <string-name>
            <given-names>A.</given-names>
            <surname>Luccioni</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Viviano</surname>
          </string-name>
          ,
          <article-title>What's in the box? an analysis of undesirable content in the common crawl corpus</article-title>
          ,
          <source>in: Proceedings of the 59th Annual Meeting of the Association for Computational Linguistics and the 11th International Joint Conference on Natural Language Processing</source>
          (Volume
          <volume>2</volume>
          :
          <string-name>
            <surname>Short</surname>
            <given-names>Papers)</given-names>
          </string-name>
          ,
          <year>2021</year>
          , pp.
          <fpage>182</fpage>
          -
          <lpage>189</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref10">
        <mixed-citation>
          [10]
          <string-name>
            <given-names>S. R.</given-names>
            <surname>Bowman</surname>
          </string-name>
          ,
          <article-title>Eight things to know about large language models</article-title>
          ,
          <source>arXiv preprint arXiv:2304.00612</source>
          (
          <year>2023</year>
          ).
        </mixed-citation>
      </ref>
      <ref id="ref11">
        <mixed-citation>
          [11]
          <string-name>
            <given-names>S.</given-names>
            <surname>Bills</surname>
          </string-name>
          ,
          <string-name>
            <given-names>N.</given-names>
            <surname>Cammarata</surname>
          </string-name>
          ,
          <string-name>
            <given-names>D.</given-names>
            <surname>Mossing</surname>
          </string-name>
          ,
          <string-name>
            <given-names>H.</given-names>
            <surname>Tillman</surname>
          </string-name>
          ,
          <string-name>
            <given-names>L.</given-names>
            <surname>Gao</surname>
          </string-name>
          ,
          <string-name>
            <given-names>G.</given-names>
            <surname>Goh</surname>
          </string-name>
          ,
          <string-name>
            <surname>I. Sutskever</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Leike</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Wu</surname>
          </string-name>
          ,
          <string-name>
            <given-names>W.</given-names>
            <surname>Saunders</surname>
          </string-name>
          ,
          <article-title>Language models can explain neurons in language models</article-title>
          , https: //openaipublic.blob.core.windows.net/neuron-explainer/paper/index.html,
          <year>2023</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref12">
        <mixed-citation>
          [12]
          <string-name>
            <given-names>E. M.</given-names>
            <surname>Bender</surname>
          </string-name>
          ,
          <string-name>
            <given-names>T.</given-names>
            <surname>Gebru</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>McMillan-Major</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Shmitchell</surname>
          </string-name>
          ,
          <article-title>On the dangers of stochastic parrots: Can language models be too big?</article-title>
          ,
          <source>in: Proceedings of the 2021 ACM conference on fairness, accountability, and transparency</source>
          ,
          <year>2021</year>
          , pp.
          <fpage>610</fpage>
          -
          <lpage>623</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref13">
        <mixed-citation>
          [13]
          <string-name>
            <given-names>H.</given-names>
            <surname>Zhang</surname>
          </string-name>
          ,
          <string-name>
            <given-names>L. H.</given-names>
            <surname>Li</surname>
          </string-name>
          ,
          <string-name>
            <given-names>T.</given-names>
            <surname>Meng</surname>
          </string-name>
          ,
          <string-name>
            <given-names>K.-W.</given-names>
            <surname>Chang</surname>
          </string-name>
          , G. V. d. Broeck,
          <article-title>On the paradox of learning to reason from data</article-title>
          ,
          <source>arXiv preprint arXiv:2205.11502</source>
          (
          <year>2022</year>
          ).
        </mixed-citation>
      </ref>
    </ref-list>
  </back>
</article>