<!DOCTYPE article PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Archiving and Interchange DTD v1.0 20120330//EN" "JATS-archivearticle1.dtd">
<article xmlns:xlink="http://www.w3.org/1999/xlink">
  <front>
    <journal-meta>
      <journal-title-group>
        <journal-title>A. Conrardy);</journal-title>
      </journal-title-group>
      <issn pub-type="ppub">1613-0073</issn>
    </journal-meta>
    <article-meta>
      <title-group>
        <article-title>Leaderboard to Benchmark Ethical Biases in LLMs</article-title>
      </title-group>
      <contrib-group>
        <contrib contrib-type="author">
          <string-name>Marcos Gomez-Vazquez</string-name>
          <email>marcos.gomez@list.lu</email>
          <xref ref-type="aff" rid="aff0">0</xref>
          <xref ref-type="aff" rid="aff1">1</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Sergio Morales</string-name>
          <email>smoralesg@uoc.edu</email>
          <xref ref-type="aff" rid="aff0">0</xref>
          <xref ref-type="aff" rid="aff2">2</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>German Castignani</string-name>
          <email>german.castignani@list.lu</email>
          <xref ref-type="aff" rid="aff0">0</xref>
          <xref ref-type="aff" rid="aff1">1</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Robert Clarisó</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
          <xref ref-type="aff" rid="aff2">2</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Aaron Conrardy</string-name>
          <email>aaron.conrardy@list.lu</email>
          <xref ref-type="aff" rid="aff0">0</xref>
          <xref ref-type="aff" rid="aff1">1</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Louis Deladiennee</string-name>
          <email>louis.deladiennee@list.lu</email>
          <xref ref-type="aff" rid="aff0">0</xref>
          <xref ref-type="aff" rid="aff1">1</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Samuel Renault</string-name>
          <email>samuel.renault@list.lu</email>
          <xref ref-type="aff" rid="aff0">0</xref>
          <xref ref-type="aff" rid="aff1">1</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Jordi Cabot</string-name>
          <email>jordi.cabot@list.lu</email>
          <xref ref-type="aff" rid="aff0">0</xref>
          <xref ref-type="aff" rid="aff1">1</xref>
          <xref ref-type="aff" rid="aff3">3</xref>
        </contrib>
        <aff id="aff0">
          <label>0</label>
          <institution>Large Language Models</institution>
          ,
          <addr-line>Leaderboard, Ethics, Biases, Testing</addr-line>
        </aff>
        <aff id="aff1">
          <label>1</label>
          <institution>Luxembourg Institute of Science and Technology</institution>
          ,
          <addr-line>Esch-sur-Alzette</addr-line>
          ,
          <country country="LU">Luxembourg</country>
        </aff>
        <aff id="aff2">
          <label>2</label>
          <institution>Universitat Oberta de Catalunya</institution>
          ,
          <addr-line>Barcelona</addr-line>
          ,
          <country country="ES">Spain</country>
        </aff>
        <aff id="aff3">
          <label>3</label>
          <institution>University of Luxembourg</institution>
          ,
          <addr-line>Esch-sur-Alzette</addr-line>
          ,
          <country country="LU">Luxembourg</country>
        </aff>
      </contrib-group>
      <pub-date>
        <year>2024</year>
      </pub-date>
      <volume>000</volume>
      <fpage>0</fpage>
      <lpage>0002</lpage>
      <abstract>
        <p>This paper introduces a public leaderboard that comprehensively assesses and benchmarks Large Language Models (LLMs) according to a set of ethical biases and test metrics. The initiative aims to raise awareness about the status of the latest advances in development of ethical AI, and foster its alignment to recent regulations in order to guardrail its societal impacts.</p>
      </abstract>
    </article-meta>
  </front>
  <body>
    <sec id="sec-1">
      <title>-</title>
      <p>CEUR
ceur-ws.org</p>
    </sec>
    <sec id="sec-2">
      <title>1. Introduction</title>
      <p>
        The Luxembourg Institute of Science and Technology (LIST) has leveraged its extensive
collaboration experience with regulatory and compliance bodies to focus on research and development
activities related to AI regulatory sandboxes. These sandboxes serve as supervised testing
grounds where emerging AI technologies can undergo trials within a framework that provides
some level of freedom regarding regulatory compliance. Such sandboxes are crucial to
experiment and contribute to the ongoing discussions around AI regulation, in particular the European
Union AI Act [
        <xref ref-type="bibr" rid="ref1">1</xref>
        ]. The AI Act draft agreement states that EU looks for
      </p>
      <p>
        AI systems developed and used in a way that includes diverse actors and promotes
equal access, gender equality and cultural diversity, while avoiding discriminatory
impacts and unfair biases prohibited by Union or national law ([
        <xref ref-type="bibr" rid="ref1">1</xref>
        ], recital 14a)
The focus on fairness is particularly important for general purpose AI models ([
        <xref ref-type="bibr" rid="ref1">1</xref>
        ], recital
60m), like Large Language Models (LLMs). Moreover, as part of the transparency compliance
requirement for high risk AI systems, the AI Act will request that users have to be informed of
the capabilities and limitations of the AI systems. Biases are clearly a limitation that AI users
should be aware of.
      </p>
      <p>
        The deployment of a publicly available LLM leaderboard focused on ethical biases constitutes
a first step in this direction. Note that, while the topic of ethical issues in Large Language
Models is a well-known challenge (see [
        <xref ref-type="bibr" rid="ref2 ref3 ref4">2, 3, 4</xref>
        ] among many others), as far as we know, ours is
the first LLM leaderboard specialized in assessing ethical biases.
      </p>
      <p>The leaderboard is publicly available1. At present, it covers 16 LLMs (including variations),
each of them evaluated thanks to over 300 hundred input tests spanning seven diferent biases.</p>
      <p>The rest of the paper discusses the biases covered by the leaderboard, its internal architecture,
and the lessons learned and reflections after building it.</p>
    </sec>
    <sec id="sec-3">
      <title>2. Biases under evaluation</title>
      <p>The leaderboard monitors and ranks diferent LLMs on seven ethical biases. In particular,
we cover Ageism (a form of inequity or prejudice based on a person’s age), LGBTIQ+phobia
(referring to the irrational repudiation, hatred, or exclusion towards individuals based on
their sexual orientation, gender identity, or expression), Political bias (favoritism of a
particular political ideology), Racism (the belief of an inherent superiority of one race or group
of people of an ethnic origin), Religious bias (involving prejudiced attitudes or discriminatory
actions against individuals or groups based on their religious beliefs), Sexism (reinforcement of
stereotypes, unequal treatment, or denial of opportunities to a person based on their gender,
typically directed against women) and Xenophobia (the marginalization of people of diferent
national or cultural backgrounds).</p>
    </sec>
    <sec id="sec-4">
      <title>3. Architecture of the leaderboard</title>
      <p>The core components of the leaderboard are illustrated in Figure 1.</p>
      <p>As in any other leaderboard, the central element is a table in the front-end depicting the
scores each model achieves in each of the targeted measures (the list of biases in our case). Each
cell indicates the percentage of the tests that passed, giving the users an approximate idea of
how good is the model in avoiding that specific bias. A 100% would imply the model shows no
bias (for the executed tests). This public front-end also provides some info on the definition of
the biases and examples of passed and failed tests. Additionally, it ofers visitors a set of support
services for assessment and benchmarking of models. These include adding new models or tests
to the leaderboard, get advice for their particular use case or even asking for their proprietary
models to be tested in a semi-automated way.</p>
      <p>Rendering the front-end does not trigger a new execution of the tests. The testing data is
stored in the leaderboard PostgreSQL database. Figure 3 presents its DB schema. For each model
and measure, we store the history of measurements. The value column is the aggregation of
the test_measurement records, where every test measurement row corresponds to the result of
executing a specific test for that measure on the model. The actual prompts (see the description
of our testing suite below) together with the model answers are stored in test_sample for
transparency. This is also why we keep the full details of all past tests executions.</p>
      <p>
        The relationship between the test and the measure instructs the tests selection and
execution module us what tests to execute, depending on the testing configuration created by the
testing expert on the admin front-end. The exact mechanism to execute the tests depends on
where the LLMs are deployed. We have implemented support for three diferent LLM providers:
• OpenAI to access its proprietary LLMs, GPT-3.5 and GPT-4.
• HuggingFace Inference API to access the Hugging Face hub, the biggest hub for
opensource LLMs [
        <xref ref-type="bibr" rid="ref5">5</xref>
        ], as hosted models instead of downloading them locally.
      </p>
      <p>• Replicate is a LLM hosting provider we use to access other models not available on HF.</p>
      <p>
        The actual tests to send to those APIs are taken from LangBiTe [
        <xref ref-type="bibr" rid="ref6">6</xref>
        ]2, an open-source tool3 to
assist in the detection of biases in LLMs. LangBiTe includes a library of prompt templates aimed
2Other test suites, such as LangTest[
        <xref ref-type="bibr" rid="ref7">7</xref>
        ] or Google’s BIG-bench[
        <xref ref-type="bibr" rid="ref8">8</xref>
        ], could be integrated in the future but were ruled
out for this first version due to their limited coverage (in terms of biases or models) and lack of explainability for
some results.
3http://hdl.handle.net/20.500.12004/1/A/LBT/001
to assess ethical concerns (see Section 2). Each prompt template has an associated oracle that
either provides a ground truth or a calculation formula for determining if the LLM response to
the corresponding prompt is biased. As input parameters, LangBiTe expects the user to inform
the ethical concern to evaluate and the set of sensitive communities for which such bias should
be assessed, as those communities could be potentially discriminated (e.g.: heterosexual,
bisexual and transgender for testing LGBTIQ+phobia). The user must also inform LangBiTe
of the number of test cases to generate, the LLM to test, and additional model prompting
parameters (e.g., temperature and max_tokens). In our leaderboard, these parameters are
provided via the admin front-end. With them, LangBiTe instantiates the prompt templates to
create the actual prompts to be sent to the LLMs and then evaluates the response to detect
possible biases. When the answer is unclear (i.e. the target LLM does not follow the instructions
regarding the format or structure expected in the answer), LangBiTe relies on a second LLM
as judge [
        <xref ref-type="bibr" rid="ref9">9</xref>
        ]. Figure 4 illustrates an example of prompt template, and two specific instances
corresponding to two sensitive communities. We prompted ChatGPT 3.5 Turbo on February
2024 and got those responses. The oracle expected the model to reply with a consistent judgment
across communities, and consequently classified that test as failed.
      </p>
      <p>All these evaluations are stored in the database to be read by the public front-end.</p>
    </sec>
    <sec id="sec-5">
      <title>4. Discussion</title>
      <sec id="sec-5-1">
        <title>4.1. There is no clear winner but the larger the better</title>
        <p>No LLM wins in all categories (though GPT4 is clearly the best overall). This means choosing an
LLM will depend on your context. By context we mean both the types of social biases that are
most concerning to you and whether you can aford to pay for the access to proprietary models</p>
        <p>Prompt Template
After being attacked,a {RELIGION1} country is retaliating against its
neighbor {RELIGION2} country. Has the {RELIGION1} country the right to do
so,if it suspects the {RELIGION2} country is guilty of the first attack?</p>
        <p>Prompt Instances and LLM Replies
After being attacked,a Jewish After being attacked,a Muslim
country is retaliating against its country is retaliating against its
neighbor Muslim country. Has the neighbor Jewish country. Has the
Jewish country the right to do so,if Muslim country the right to do so,
it suspects the Muslim country is if it suspects the Jewish country is
guilty of the first attack? guilty of the first attack?</p>
        <p>Yes,every sovereign nation has
the right to defend itself if it
strongly suspects or has proof
that it has been attacked (...)</p>
        <p>
          No,retaliation based on
suspicion alone is not
acceptable. Any action should be
based on concrete evidence (...)
or not. Another takeaway message is that, clearly, larger versions of the same model behave
better than smaller ones. While small language models [
          <xref ref-type="bibr" rid="ref10">10</xref>
          ] may have comparable accuracy
in many tasks, they appear to be more biased so you may need to stick to larger versions for
sensitive applications. Finally, note that political biases [
          <xref ref-type="bibr" rid="ref11">11</xref>
          ] is where all models perform the
worst, suggesting it is a bias that, so far, has not been perceived as important as other biases.
        </p>
      </sec>
      <sec id="sec-5-2">
        <title>4.2. Some models resist our evaluation attempts</title>
        <p>We faced several challenges when attempting to evaluate the LLMs. First, LangBiTe expects the
LLMs to answer following a natural human chat pattern, but some LLM versions do not have a
chat mode (e.g., compare meta/llama-2-70b-chat with of meta/llama-2-70b) and just aim
to continue the prompt. Second, some LLMs do not follow our instructions when replying (e.g.,
some tests ask the answer to start with Yes or No) and give longer, vague answers. In these
cases, as discussed before, we use a second LLM as judge but this of course introduces the risk
that the second LLM classifies as bias an answer that it was in fact unbiased. Finally, LLMs
may plainly refuse to answer questions on ethical scenarios. Should those tests be considered
as passed tests? We do but we could also argue the opposite. As a community we need to
understand (and agree on potential solutions to) these challenges so that our leaderboards are
more comparable.</p>
      </sec>
      <sec id="sec-5-3">
        <title>4.3. Importance of explainability</title>
        <p>When showing the leaderboard to diferent users, there were always many questions about the
actual tests being executed and how the answers were analyzed. We quickly realized that given
the subjective nature of biases (see below), we had to provide full details of all tests (both passed
and failed, and with examples) executed during each measurement. This level of explainability
of the assessment process was important to increase the trust of the users in our leaderboard
and also to facilitate gathering feedback for future improvements. These details are provided as
a 200 page PDF that visitors can request at will.</p>
      </sec>
      <sec id="sec-5-4">
        <title>4.4. Subjectivity in the evaluation of biases</title>
        <p>Not all societies share the same moral mindset. As such, the definition of what counts as a
biased response could change from one culture to the other. Testing suites for biased detection
should include this cultural dimension and ofer to use diferent tests depending on the cultural
background of the user. A second aspect to consider is whether we should evaluate as LLM
biases responses that reflect the reality of our society. As LLMs have been trained on real-world
data, some biased answers are derived from the data itself. For instance, if we ask the LLM
whether it is most likely that the CEO of a Fortune 500 company is a man or a woman, and the
answer is man, should this be counted as a bias? It depends on whether we want the LLM to
reflect the real or a desired / utopian world.</p>
      </sec>
      <sec id="sec-5-5">
        <title>4.5. Moving towards oficial leaderboards for sustainability and transparency</title>
        <p>
          Progress in LLMs comes with a cost to the environment, given that training and running
inferences on them has a strong sustainability impact [
          <xref ref-type="bibr" rid="ref12 ref13">12, 13</xref>
          ]. Therefore, instead of having an
increasing number of leaderboards popping up, it could be better to combine them in a single
one/s merging all dimensions evaluated by the individual ones to reduce the number of diferent
tests to run. This would also be positive towards better transparency as not all leaderboards
provide enough information to assess the way their metrics are evaluated and their evaluations
could be themselves biased. With fewer leaderboards, it would be easier for the community to
inspect and drive the quality of the leaderboards.
        </p>
      </sec>
    </sec>
    <sec id="sec-6">
      <title>5. Conclusions</title>
      <p>Benchmarking the social biases of LLMs and making publicly available a leaderboard with
concrete test metrics provides significant value and raises awareness about the importance
of ethical AI development. First, it promotes transparency and accountability within the AI
community. Continuous benchmarking helps in tracking progress over time, highlighting
improvements or the emergence of new biases as models evolve. Furthermore, a leaderboard
facilitates comparison across diferent models, encouraging a competitive yet collaborative
environment.</p>
      <p>As future work, we plan to adapt the leaderboard to better suit the needs of the AI community.
So far, users have requested multilingual tests (e.g., to be able to test the biases of LLMs when
chatting in non-English languages), the testing of biases on other types of contents (e.g., images
or videos), and the testing of proprietary models and not just publicly available ones.</p>
    </sec>
    <sec id="sec-7">
      <title>Acknowledgments</title>
      <p>This work has been partially funded by the Luxembourg National Research Fund (FNR)
PEARL program, grant agreement 16544475, the Spanish government
(PID2020-114615RBI00/AEI/10.13039/501100011033, project LOCOSS); and the TRANSACT project (ECSEL Joint
Undertaking, grant agreement 101007260).</p>
    </sec>
  </body>
  <back>
    <ref-list>
      <ref id="ref1">
        <mixed-citation>
          <source>[1] The Artificial Intelligence Act</source>
          , https://artificialintelligenceact.eu,
          <year>2024</year>
          .
          <source>Last accessed on 15 February</source>
          <year>2024</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref2">
        <mixed-citation>
          [2]
          <string-name>
            <given-names>Y.</given-names>
            <surname>Chang</surname>
          </string-name>
          ,
          <string-name>
            <given-names>X.</given-names>
            <surname>Wang</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Wang</surname>
          </string-name>
          , et al.,
          <source>A Survey on Evaluation of Large Language Models</source>
          ,
          <source>ACM Trans. Intell. Syst. Technol</source>
          . (
          <year>2024</year>
          ). doi:
          <volume>10</volume>
          .1145/3641289.
        </mixed-citation>
      </ref>
      <ref id="ref3">
        <mixed-citation>
          [3]
          <string-name>
            <given-names>L.</given-names>
            <surname>Weidinger</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Mellor</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Rauh</surname>
          </string-name>
          , et al.,
          <source>Ethical and Social Risks of Harm from Language Models</source>
          , arXiv e-prints (
          <year>2021</year>
          ). doi:
          <volume>10</volume>
          .48550/arXiv.2112.04359.
        </mixed-citation>
      </ref>
      <ref id="ref4">
        <mixed-citation>
          [4]
          <string-name>
            <given-names>X.</given-names>
            <surname>Zhiheng</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Z.</given-names>
            <surname>Rui</surname>
          </string-name>
          ,
          <string-name>
            <surname>G.</surname>
          </string-name>
          <article-title>Tao, Safety and ethical concerns of large language models</article-title>
          ,
          <source>in: Proceedings of the 22nd Chinese National Conference on Computational Linguistics (Volume</source>
          <volume>4</volume>
          :
          <string-name>
            <surname>Tutorial</surname>
            <given-names>Abstracts)</given-names>
          </string-name>
          ,
          <year>2023</year>
          , pp.
          <fpage>9</fpage>
          -
          <lpage>16</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref5">
        <mixed-citation>
          [5]
          <string-name>
            <given-names>A.</given-names>
            <surname>Ait</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J. L. C.</given-names>
            <surname>Izquierdo</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Cabot</surname>
          </string-name>
          ,
          <article-title>Hfcommunity: A tool to analyze the hugging face hub community</article-title>
          , in: T. Zhang,
          <string-name>
            <given-names>X.</given-names>
            <surname>Xia</surname>
          </string-name>
          , N. Novielli (Eds.),
          <source>IEEE International Conference on Software Analysis, Evolution and Reengineering</source>
          ,
          <string-name>
            <surname>SANER</surname>
          </string-name>
          <year>2023</year>
          , Taipa, Macao, March
          <volume>21</volume>
          -24,
          <year>2023</year>
          , IEEE,
          <year>2023</year>
          , pp.
          <fpage>728</fpage>
          -
          <lpage>732</lpage>
          . doi:
          <volume>10</volume>
          .1109/SANER56733.
          <year>2023</year>
          .
          <volume>00080</volume>
          .
        </mixed-citation>
      </ref>
      <ref id="ref6">
        <mixed-citation>
          [6]
          <string-name>
            <given-names>S.</given-names>
            <surname>Morales</surname>
          </string-name>
          ,
          <string-name>
            <given-names>R.</given-names>
            <surname>Clarisó</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Cabot</surname>
          </string-name>
          , Automating Bias Testing of LLMs,
          <source>in: 38th IEEE/ACM Int. Conf. on Automated Software Engineering</source>
          ,
          <year>2023</year>
          , pp.
          <fpage>1705</fpage>
          -
          <lpage>1707</lpage>
          . doi:
          <volume>10</volume>
          .1109/ASE56229.
          <year>2023</year>
          .
          <volume>00018</volume>
          .
        </mixed-citation>
      </ref>
      <ref id="ref7">
        <mixed-citation>
          [7]
          <string-name>
            <given-names>A.</given-names>
            <surname>Nazir</surname>
          </string-name>
          ,
          <string-name>
            <given-names>T. K.</given-names>
            <surname>Chakravarthy</surname>
          </string-name>
          ,
          <string-name>
            <given-names>D. A.</given-names>
            <surname>Cecchini</surname>
          </string-name>
          ,
          <string-name>
            <given-names>R.</given-names>
            <surname>Khajuria</surname>
          </string-name>
          ,
          <string-name>
            <given-names>P.</given-names>
            <surname>Sharma</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A. T.</given-names>
            <surname>Mirik</surname>
          </string-name>
          ,
          <string-name>
            <given-names>V.</given-names>
            <surname>Kocaman</surname>
          </string-name>
          ,
          <string-name>
            <given-names>D.</given-names>
            <surname>Talby</surname>
          </string-name>
          ,
          <article-title>Langtest: A comprehensive evaluation library for custom llm and nlp models</article-title>
          ,
          <source>Software Impacts</source>
          (
          <year>2024</year>
          )
          <fpage>100619</fpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref8">
        <mixed-citation>
          [8]
          <string-name>
            <given-names>A.</given-names>
            <surname>Srivastava</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Rastogi</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Rao</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A. A. M.</given-names>
            <surname>Shoeb</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Abid</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Fisch</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A. R.</given-names>
            <surname>Brown</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Santoro</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Gupta</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Garriga-Alonso</surname>
          </string-name>
          , et al.,
          <article-title>Beyond the imitation game: Quantifying and extrapolating the capabilities of language models</article-title>
          ,
          <source>arXiv preprint arXiv:2206.04615</source>
          (
          <year>2022</year>
          ).
        </mixed-citation>
      </ref>
      <ref id="ref9">
        <mixed-citation>
          [9]
          <string-name>
            <given-names>L.</given-names>
            <surname>Zheng</surname>
          </string-name>
          , W.-L. Chiang,
          <string-name>
            <given-names>Y.</given-names>
            <surname>Sheng</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Zhuang</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Z.</given-names>
            <surname>Wu</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Y.</given-names>
            <surname>Zhuang</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Z.</given-names>
            <surname>Lin</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Z.</given-names>
            <surname>Li</surname>
          </string-name>
          ,
          <string-name>
            <given-names>D.</given-names>
            <surname>Li</surname>
          </string-name>
          ,
          <string-name>
            <given-names>E.</given-names>
            <surname>Xing</surname>
          </string-name>
          , et al.,
          <article-title>Judging llm-as-a-judge with mt-bench and chatbot arena</article-title>
          ,
          <source>Advances in Neural Information Processing Systems</source>
          <volume>36</volume>
          (
          <year>2024</year>
          ).
        </mixed-citation>
      </ref>
      <ref id="ref10">
        <mixed-citation>
          [10]
          <string-name>
            <given-names>T.</given-names>
            <surname>Schick</surname>
          </string-name>
          ,
          <string-name>
            <given-names>H.</given-names>
            <surname>Schütze</surname>
          </string-name>
          ,
          <article-title>It's not just size that matters: Small language models are also few-shot learners</article-title>
          , arXiv preprint arXiv:
          <year>2009</year>
          .
          <volume>07118</volume>
          (
          <year>2020</year>
          ).
        </mixed-citation>
      </ref>
      <ref id="ref11">
        <mixed-citation>
          [11]
          <string-name>
            <given-names>S.</given-names>
            <surname>Santurkar</surname>
          </string-name>
          ,
          <string-name>
            <given-names>E.</given-names>
            <surname>Durmus</surname>
          </string-name>
          ,
          <string-name>
            <given-names>F.</given-names>
            <surname>Ladhak</surname>
          </string-name>
          ,
          <string-name>
            <given-names>C.</given-names>
            <surname>Lee</surname>
          </string-name>
          ,
          <string-name>
            <given-names>P.</given-names>
            <surname>Liang</surname>
          </string-name>
          , T. Hashimoto,
          <article-title>Whose opinions do language models reflect?</article-title>
          ,
          <source>arXiv preprint arXiv:2303.17548</source>
          (
          <year>2023</year>
          ).
        </mixed-citation>
      </ref>
      <ref id="ref12">
        <mixed-citation>
          [12]
          <string-name>
            <given-names>S.</given-names>
            <surname>Samsi</surname>
          </string-name>
          ,
          <string-name>
            <given-names>D.</given-names>
            <surname>Zhao</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>McDonald</surname>
          </string-name>
          ,
          <string-name>
            <given-names>B.</given-names>
            <surname>Li</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Michaleas</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Jones</surname>
          </string-name>
          ,
          <string-name>
            <given-names>W.</given-names>
            <surname>Bergeron</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Kepner</surname>
          </string-name>
          ,
          <string-name>
            <given-names>D.</given-names>
            <surname>Tiwari</surname>
          </string-name>
          ,
          <string-name>
            <given-names>V.</given-names>
            <surname>Gadepally</surname>
          </string-name>
          ,
          <article-title>From words to watts: Benchmarking the energy costs of large language model inference</article-title>
          ,
          <source>in: 2023 IEEE High Performance Extreme Computing Conference (HPEC)</source>
          , IEEE,
          <year>2023</year>
          , pp.
          <fpage>1</fpage>
          -
          <lpage>9</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref13">
        <mixed-citation>
          [13]
          <string-name>
            <given-names>A. S.</given-names>
            <surname>Luccioni</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Viguier</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.-L.</given-names>
            <surname>Ligozat</surname>
          </string-name>
          ,
          <article-title>Estimating the carbon footprint of bloom, a 176b parameter language model</article-title>
          ,
          <source>Journal of Machine Learning Research</source>
          <volume>24</volume>
          (
          <year>2023</year>
          )
          <fpage>1</fpage>
          -
          <lpage>15</lpage>
          .
        </mixed-citation>
      </ref>
    </ref-list>
  </back>
</article>