<!DOCTYPE article PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Archiving and Interchange DTD v1.0 20120330//EN" "JATS-archivearticle1.dtd">
<article xmlns:xlink="http://www.w3.org/1999/xlink">
  <front>
    <journal-meta />
    <article-meta>
      <title-group>
        <article-title>Sustainable LLM Benchmarking: Eficient Evaluation Optimizing Costs and Resources</article-title>
      </title-group>
      <contrib-group>
        <contrib contrib-type="author">
          <string-name>Nathanaël Lemonnier</string-name>
          <email>nathanael.lemonnier@se.com</email>
          <xref ref-type="aff" rid="aff0">0</xref>
          <xref ref-type="aff" rid="aff3">3</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Jacques Kluska</string-name>
          <email>jacques.kluska@se.com</email>
          <xref ref-type="aff" rid="aff0">0</xref>
          <xref ref-type="aff" rid="aff2">2</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Vincent Morel</string-name>
          <email>vincent.morel@se.com</email>
          <xref ref-type="aff" rid="aff0">0</xref>
          <xref ref-type="aff" rid="aff1">1</xref>
        </contrib>
        <aff id="aff0">
          <label>0</label>
          <institution>2nd Workshop on Green-Aware Artificial Intelligence</institution>
          ,
          <addr-line>28th European Conference on Artificial Intelligence, ECAI 2025</addr-line>
        </aff>
        <aff id="aff1">
          <label>1</label>
          <institution>AI Technology, AI Hub</institution>
          ,
          <addr-line>Schneider Electric, 35 rue Joseph Monier, Rueil-Malmaison, 92800</addr-line>
          ,
          <country country="FR">France</country>
        </aff>
        <aff id="aff2">
          <label>2</label>
          <institution>Responsible AI, AI Hub</institution>
          ,
          <addr-line>Schneider Electric, 160 av. des Martyrs, Grenoble, 38000</addr-line>
          ,
          <country country="FR">France</country>
        </aff>
        <aff id="aff3">
          <label>3</label>
          <institution>Strategy &amp; Innovation, AI Hub</institution>
          ,
          <addr-line>Schneider Electric, 160 av. des Martyrs, Grenoble, 38000</addr-line>
          ,
          <country country="FR">France</country>
        </aff>
      </contrib-group>
      <pub-date>
        <year>2026</year>
      </pub-date>
      <abstract>
        <p>The surge and continuous delivery of novel Language Models (LMs) increase the need to compare their performance for appropriate selection for industrial applications. Leaderboard platforms rely on benchmark methodology that provides scores over the evaluation of completion of several queries, regrouped under a specific domain. However, the number of queries in these datasets is often huge, mobilising time, cost, and resources. We developed a methodology to drastically reduce the volumetry of such dataset, with negligible loss of the baseline score. The direct implications are savings in costs, duration, and carbon emission, incorporated altogether for a sustainable and responsible approach in assessing LMs performance. We used datasets from the HuggingFace OpenLLM Leaderboard v1 and tracked the performance score obtained from inferences with two models (gpt-3.5-turbo-1106 and gpt-4o-mini-2024-07-18). We used a local slope reduction method and the kneedle algorithm on lowess smoothing to determine the suficient volume of queries, and associated loss in performance score. While most of the public benchmarks focus on only one metric, we incorporated four of them simultaneously (performance, cost, latency, carbon footprint). This allowed us to demonstrate that using only 40.7% of the datasets leads to saving 26 hours of inference computation and 1.818 kg CO2eq (&gt;59% reduction), for a 0.75% relative loss on performance score precision. Limiting the number of queries to benchmark LMs is a powerful trigger to adapt LMs consumption to a responsible and frugal use of AI and have more control over economical and ecological constraints. Our findings pave the way to consciousness about the unnecessary utilization of large volumes of queries in LMs Benchmark, and call for a mindset shift that should be applied when developing and evaluating industrial AI applications.</p>
      </abstract>
      <kwd-group>
        <kwd>eol&gt;Benchmark</kwd>
        <kwd>Responsible AI</kwd>
        <kwd>LLMs</kwd>
        <kwd>Sustainability</kwd>
      </kwd-group>
    </article-meta>
  </front>
  <body>
    <sec id="sec-1">
      <title>1. Introduction</title>
      <p>
        A vast number of Language Models (LMs) have been released from several providers[
        <xref ref-type="bibr" rid="ref1">1</xref>
        ] since the
recent surge of AI, and recent advances allow novel technologies to be available, like Large Reasoning
Models (LRMs), Agentic AI, etc. Matching a specific use case to the most appropriate LM can be
cumbersome. Between Small LMs (SLMs), Large LMs (LLMs), LRMs, and so on, the challenge for users
will be to select the best compromise between performance, latency, and cost. Moreover, generative
AI models can have a significant energy and environmental footprint[
        <xref ref-type="bibr" rid="ref2">2</xref>
        ]. It is crucial to monitor and
limit their environmental impact[
        <xref ref-type="bibr" rid="ref3">3</xref>
        ]. To facilitate users’ decision-making, a myriad of benchmarks is
available[
        <xref ref-type="bibr" rid="ref4">4, 5, 6, 7, 8, 9, 10</xref>
        ], providing a ranking of the LMs (i.e. leaderboard) based on scores computed
from the evaluation of multiple completions obtained from sending queries of distinct datasets (i.e.
benchmarks) to LM endpoints. The benchmarks are organized by specific domains: knowledge and
language understanding, reasoning capabilities, grounding and abstractive summarization, content
moderation and narrative control, coding capabilities, opinion and sentiment neutrality. Gathering the
results from several benchmarks belonging to the same domain leads to an overall comprehension of
LMs performance in such domain.
      </p>
      <p>However, ranking provided on standard leaderboards with public benchmarks may not be of interest
to all users. Benchmarks are run on third-party infrastructure, with static parameters, which may
not be those required by the business solutions. Latency may difer depending on the infrastructure,
performance may be impacted by the API call parameters, costs may vary depending on the company
pricing plans, etc. Companies will thus want to assess their own benchmarks on their own data and
infrastructure to fully match their use case needs and implementations. More importantly, the volumetry
of benchmarks can reach tens of thousands of queries or more, meaning that their implementation and
processing are time-consuming and lead to unnecessary resource consumption. Eforts are required to
reduce time-to-value and fit leaderboard to a responsible process.</p>
      <p>Eforts have been developed in the community to reduce benchmarks, mostly relying on filtering the
datasets and selecting a subset, based either on task or queries similarities[11, 12]. They also involve
stratified random sampling on subsets scenario[ 13] or clustering[14], considering that some components
of the benchmark may be less weighted. Computation savings are also considered[13].</p>
      <p>In this paper, we developed a method that considers the whole set of queries equally, agnostically
suggesting a cutof based on multiple randomized distributions of the queries without filtering or
weights. We achieved similar performance values to the original benchmark. More importantly, we
did not only focus on performance score, but also considered savings on carbon emissions, cost and
duration, which are crucial considerations when developing AI applications. We experimentally tested
our methodology using commonly used datasets, with a common metric (accuracy) and considering
single-domain benchmarking, to ease understanding (Arc-C[15], GSM8k[16], Hellaswag[17], MMLU[18],
Winogrande[19], OpinionQA[20]). We used two popular models (gpt-3.5-turbo[21] and gpt-4o-mini[22]),
which were selected for their relatively low carbon emission.</p>
    </sec>
    <sec id="sec-2">
      <title>2. Experimental Setup</title>
      <sec id="sec-2-1">
        <title>2.1. Workflow</title>
        <p>The workflow, presented in Figure 1, bootstraps over 1,000 iterations. Each iteration uses a randomized
distribution of the queries in the dataset to detect a cutof, based on lowess smoothing of the absolute
values of successive slopes computed from windows including a small number of queries. Cutofs are
then averaged, providing their values follows normal distribution. The novelty of this framework is
that the subset filtered by the final cutof is agnostic regarding the queries order, interdependency or
importance.</p>
        <p>From a given dataset and a given LM, we evaluated the completions to compute the performance
score, used as a baseline, as given by accuracy (ratio of True over total). The next token randomness and
probability parameters were alike for all datasets: temperature=0.7, top_p=0.95. The evaluation returned
a boolean upon correctness of the completion regarding reference answer, using regex to postprocess the
completion. Moreover, verbosity of input and output, cost[23], and latency were collected. Eventually,
carbon emissions equivalences were computed using the Ecologits Calculator[24] Expert Mode, giving
0.00092 gCO2eq mper output token for gpt-3.5-turbo, and 0.00054 gCO2eq for gpt-4o-mini, based on
location in France.</p>
      </sec>
      <sec id="sec-2-2">
        <title>2.2. Bootstrap of Performance Score Tracking</title>
        <p>We tracked the evolution of the performance score by computing the accuracy obtained at any
successive index of the datasets. The tracking curve eventually reached a plateau, ending at the
performance score, used for baseline.</p>
        <p>Kneedle algorithm. The purpose was to identify the root of the plateau, from which the performance
stops evolving. Due to the signal-like pattern of the tracking curve, making it impractical to detect the
plateau, we first transformed it to allow smoothing. We fractioned the tracking curve into multiple
frames of length n and computed the resulting slopes. We applied a lowess approximation on the
slopes values and used the kneedle algorithm[25] with appropriate curve pattern (e.g. ‘concave’ or
‘convex’) to find the knee (or elbow, respectively) as an optimal cutof. The delta between the associated
performance at cutof and baseline performance score represents the performance loss.</p>
        <p>Bootstrap. To ensure a negligible efect of queries order, we first shufled the index, then tracked the
performance score, logged both cutof and performance loss, and repeated the process 1,000 times (see
Figure 1).</p>
        <p>Selected cutof. Eventually, we verified the assumption of normal distribution of the cutofs collected
over the 1,000 iterations with a two-sample Kolmogorov-Smirnov (KS) test against a normal distribution.
The selected cutof was eventually defined as the mean of the distribution. Again, the associated
performance score at the selected cutof was compared to the baseline performance score to compute
loss.</p>
        <p>Methodology validity. Since the method requires a frame of length n to compute slopes, we evaluated
the impact of n on the performance score induced by the selected cutof (Figure 2). We tested n both
as a static number, alike for all datasets, and as a relative proportion of the number of queries in the
dataset (dataset length). This allowed to select an optimal frame length to compute slopes, which would
a) allow to have enough points to compute slopes, b) impact a minimal loss on performance score, and
c) maximise the number of slopes.</p>
      </sec>
      <sec id="sec-2-3">
        <title>2.3. Cutof Evaluation</title>
        <p>We assessed robustness of the method by self-applying the cutof to the datasets, iteratively. This
allowed to understand the limitations of the dataset volumetry on which our method can be used. From
the initial volumetry, we detected the cutofs of the 1,000 bootstrap iterations, the associated deviation
to normal distribution with KS p-value, and selected cutof (mean of distribution). We cropped the
dataset to the selected cutof and re-iterated the methodology, and so on until we could reject that
the distribution of 1,000 cutofs and normal distribution were sampled from the same distribution
(p-value&lt;5e-02).</p>
      </sec>
    </sec>
    <sec id="sec-3">
      <title>3. Results</title>
      <p>We applied the methodology using gpt-3.5-turbo (1106) and gpt-4o-mini (2024-07-18) with adapted
system prompts (See Table A1) to generate completions on the following datasets: Arc-C (2,590
gradeschool level, multiple-choice science questions from the Challenge Set) ; GSM8k (8,792 grade school
math problem) ; Hellaswag (49,947 common sense inference queries) ; MMLU (15,858 queries over
57 tasks like elementary mathematics, US history, computer science, law, etc. from diferent levels
like college, high school. . . ) ; Winogrande (40,490 common sense reasoning questions) ; OpinionQA
(6,024 multiple choices queries to challenge the opinion of the LM of social subjects). The baseline
performance scores are in Table 1.</p>
      <sec id="sec-3-1">
        <title>3.1. Frame Length</title>
        <p>The first step is to select a suitable frame length to compute the slopes, the lowess curve, and induce
the cutof. Using a static number of queries in the frame does not allow us to conclude (Figure 2.a.).
Although the cutof occurs around 40% of each dataset for small frames (up to 100), it drastically drops
(down to 1% with 2,500-long frame for Arc-C) and is dependent on the number of queries (dataset
length). Indeed, while using 2,500 queries for Arc-C leads to compute the lowess curve on only 2 slopes
values, it still allows 20 slopes with Hellaswag, and reaches a cutof at 39.5% of the dataset. A similar
observation impacts the loss on performance score. While stable up to 100 queries in the frame (loss
from 0.002 on Hellaswag up to 0.008 on Arc-C), the loss rises up to 0.104 (variation over 0.769, hence a
13.5% error margin) with 2,500 queries on Arc-C.</p>
        <p>We used a frame length relative to the dataset length (Figure 2.b.) to have comparable outcomes
across datasets. We observed a drop for frames larger than 5% of the dataset length. The performance
loss is comparable between 1% and 5% (Arc-C: 0.008 with 1% to 0.01 with 5%, Hellaswag: 0.028 with 1%
to 0.002 with 5%), as well as the percentage of the dataset concerned by the cutof (Arc-C: 40.8% with
1% to 40.1% with 5%, Hellaswag: 40.7% with 1% to 39.8% with 5%).</p>
        <p>We thus selected 1% of the dataset length as a satisfying length for the frames.</p>
      </sec>
      <sec id="sec-3-2">
        <title>3.2. Application</title>
        <p>An example of the methodology outcomes is presented in Figure 3: the tracking of the performance
score along the queries (bottom), the tracking of the slopes (top, in blue, with n=1%), the lowess curve
(top, in red) and the cutof (dashed green line). Collecting cutofs for all 1,000 iterations allows to
build the distribution (Figure A1), which we compared to normal distribution (example on Figure 4
with Arc-C). The KS tests conclusions were in favor of similarity between normal and knee index
distributions (gpt-3.5-turbo on arc-c: p=0.86 ; gpt-4o-mini: p=0.46, see Table A2). Hence, the mean of
the distribution can be used as a cutof for each dataset.</p>
        <p>The baseline performance score, cutof, loss on performance score and volumetry induced by the
cutof are depicted for each dataset in Table 1 (see Figure 5 for deviation from initial performance
score). The method was reproducible across datasets with very consistent cutof of 40.7% of the dataset
volumetry regardless of model and initial performance score. The performance loss was also consistent
with a mean relative loss of 0.75 ± 0.53% (absolute loss average is 0.43 ± 0.23% for gpt-3.5-turbo, 0.38 ±
0.18% for gpt-4o-mini).</p>
      </sec>
      <sec id="sec-3-3">
        <title>3.3. Ecological and Economical Impact</title>
        <p>Given the current prices for inference, the compromise between costs reduction and potential loss on
performance is undisputable (Table 2). Moreover, on top of the 26 h saved to process the benchmarks
for both models (8 h 42 min instead of 21 h 24 min for gpt-3.5-turbo, 8 h 55 min instead of 22 h 18 min
for gpt-4o-mini), the amount of carbon emissions saved is consequent: 1.818 kgCO2eq (770.9 g instead
of 1898.1 g for gpt-3.5-turbo, 464.7 g instead of 1155.3 g for gpt-4o-mini), reaching over 59% reduction.
(a) Impact of frame length on the performance loss (as a fixed number of items in the frame for
all datasets)
(b) Impact of frame length on the performance loss (as a number of items in the frame dependant
of each dataset length)</p>
      </sec>
      <sec id="sec-3-4">
        <title>3.4. Limitations</title>
        <p>Applying the cutof on the same dataset iteratively ends up with very consistent results. The cutof
occurs around 40% at each iteration, regardless of the dataset size (Figure 6). However, the KS p-value
reaches significance for datasets smaller than 101 queries for gpt-3.5-turbo (e.g. GSM8k, p-value=4.9e-02)
or 233 for gpt-4o-mini (e.g. hellaswag, p-value=3.9e-03). Considering the iteration before reaching
significance, we can observe that the relative loss in performance ranges from 3.2 to 7.4% for
gpt-3.5turbo, and from 1.1 to 11.0% for gpt-4o-mini. These observations help in understanding how many
queries are required to provide significant performance score.</p>
      </sec>
    </sec>
    <sec id="sec-4">
      <title>4. Discussion</title>
      <sec id="sec-4-1">
        <title>4.1. Opportunities for savings</title>
        <p>To drastically crop the datasets for similar performance result, we used a reduction methodology on the
tracking of the performance scores, for several datasets with completions generated using two popular
language models. The adoption of a sustainable and responsible approach, assess carbon emission, cost
and latency along performance, is a key contribution of our study. Moreover, our approach considers
all queries in the datasets without subset filtering, with each query having equal weight. Additional
feature could be that, for multi-domain datasets, a cutof would need to follow constrained shufling to
respect domains proportion in the original dataset. The methodology would benefit from reproducing
the results on additional datasets. However, the consistent results presented here proved that the
method does not depend on the dataset. We also adapted the methodology to the TruthfulQA[26]
dataset (see Figure A2), which relies on an NLP metric (e.g. BLEURT[27]) to evaluate completion over
817 open questions. The method required a normalization of the metric to adapt the BLEURT score
range (from -1.81 to 1.13 range to 0-100). We obtained a cutof at 40.7% of the dataset (333 queries) for
a 1.06% loss (0.654 point diference from the final score of 61.84). Moreover, we used models that are
less costly, with a certain pattern of carbon emission (0.9 and 0.5 mg per token). Confronting the latest
LRM to these side-by-side would be interesting. For example, gpt-4o-mini-2024-07-18 is only 1% the
cost of o1-2024-12-17, and 6.1% of o1 emissions per token (8.85 mgCO2eq per token in France). Using
non-public datasets (data not shown), we estimated that gpt-4o-mini verbosity was 12% of o1 in low
reasoning mode (when o1 uses 100 tokens to reason and generate a completion, gpt-4o-mini uses 12
tokens). Given that, in the frame of the current study, gpt-4o-mini would generate 2,135,513 tokens, the
o1 equivalence would be 17,795,941 tokens, reaching a value of 157.5 kgCO2eq, and total cost of USD
318.5 for input and USD 1174.5 for output. Using our methodology would save up to 93.4kg CO2eq
and USD 885.3 benchmarking o1 on these datasets. Scaling the mindset of this paper up to the whole
landscape of models and leaderboard providers (possibly hundreds of models on several platforms[28]),
one could leverage savings on tons of CO2eq savings and thousands of dollars.</p>
      </sec>
      <sec id="sec-4-2">
        <title>4.2. Datasets size limitations</title>
        <p>We studied large datasets for which volumetry reduction makes sense. However, we tested the limits of
the methodology by self-applying the cutof strategy iteratively to the same dataset. As expected, at
some point, a small dataset is irrelevant since the performance loss is consequent, and the distribution
of the detected cutof for the 1,000 iterations deviates from normal distribution. Even before reaching
significant non-normal distribution, the performance loss reaches over 5% for some datasets (GSM8k,
Opinionqa with gpt-3.5-turbo). While the trade-of between performance loss, resource savings, and
methodology robustness (normal approximation) depends on the AI task and its applications, our
results roughly indicate that a dataset volume of about 1,000 queries is a minimal amount to find an
excellent robustness with acceptable performance loss of 1-3%. For the larger dataset (hellaswag: 49,947),
respectively with gpt-3.5-turbo and gpt-4o-mini, we could iteratively reach 1,368 and 1,371 queries
by keeping 1.1% and 1.2% loss on initial performance, thus using only 2.7% of the dataset. Instead of
reaching 78.7 and 79.1% as initially obtained with 49,947 queries, we could achieve 77.8–79.6% and
78.1–80.0%.</p>
      </sec>
      <sec id="sec-4-3">
        <title>4.3. Importance of parameters optimization</title>
        <p>Although such delta between scores may not make the diference in a leaderboard for models with similar
performance, one must remind that tuning parameters and system prompt remain among the important
factors to enhance the performance – without mentioning in-context learning or fine-tuning techniques.
While models may be tie on leaderboards score, tuning parameters, prompt, or injected knowledge will
eventually have a greater impact on their performance. We chose a static set of parameters for the sake
of comparison, and simple system prompts to minimize complexity ; the performance values presented
here should therefore not be considered definitive. AI developers must adapt tuning to their cases.</p>
      </sec>
    </sec>
    <sec id="sec-5">
      <title>5. Conclusion</title>
      <p>We developed and tested a robust method to assess volume reduction for popular datasets used in
public benchmarks and leaderboards. We achieved to drastically reduce the number of queries in these
datasets with negligible performance loss, inducing considerable cost, duration, and carbon footprint
savings. Overall, using only 40.7% of the datasets leads to save 26 h of inference computation and 1.8
kgCO2eq (&gt;59% reduction), for a 0.75% relative loss on performance score. For the largest dataset in this
study, we achieved 1.1% loss on performance with only 2.7% of the dataset.</p>
      <p>Our study relies on datasets possibly reaching saturation by the latest LMs, hence future work should
include more recent and complex datasets, with several metrics and domains. Moreover, we used
two models that are believed to be popular and not among the most resource consuming. The study
will benefit from further inclusion of resource consuming models like LRMs. Eventually, a method
that allows for a real-time evaluation of the cutof – where adding more will not help much – will be
beneficial to the community.</p>
      <p>Placing these numbers back in the whole leaderboard landscape, where dozens to hundreds of models
are benchmarked, the amount of saved CO2eq could reach the order of tons and the amount of cost
savings reach thousands of dollars. We also achieved significant progress towards understanding the
optimal number of queries. Not enough data means too much variability and less robustness, while too
many queries means a waste of resource and non-sustainable mindset.</p>
      <p>In the era of all possible with AI but limited overall resource, this mindset should be applied when
developing and evaluating industrial AI applications. Benchmarking LMs is the right approach to select
the most appropriate and most sustainable model and using the most appropriate number of queries is
the first step towards sustainability.</p>
    </sec>
    <sec id="sec-6">
      <title>Acknowledgments</title>
      <p>This work was supported by the AI Hub of Schneider Electric. We would like to thank Julia Peyre, Head
of Strategy and Innovation, Schneider Electric, and Claude Le Pape, Fellow Data Scientist, Schneider
Electric, for their support and review of the current work.</p>
    </sec>
    <sec id="sec-7">
      <title>Declaration on Generative AI</title>
      <p>The authors have not employed any Generative AI tools in the writing, review and edition of this
document.
[5] W.-L. Chiang, L. Zheng, Y. Sheng, et al., Chatbot arena: An open platform for evaluating llms by
human preference, 2024. URL: https://arxiv.org/abs/2403.04132. arXiv:2403.04132.
[6] X. Li, T. Zhang, Y. Dubois, et al., Alpacaeval: An automatic evaluator of instruction-following
models, 2023. URL: https://github.com/tatsu-lab/alpaca_eval.
[7] S. J. Paech, Eq-bench: An emotional intelligence benchmark for large language models, 2024. URL:
https://arxiv.org/abs/2312.06281. arXiv:2312.06281.
[8] F. Yan, H. Mao, C. C.-J. Ji, et al., Berkeley function calling leaderboard, 2024. URL: https://gorilla.cs.</p>
      <p>berkeley.edu/blogs/8_berkeley_function_calling_leaderboard.html.
[9] V. Lai, N. T. Ngo, A. P. B. Veyseh, F. Dernoncourt, T. H. Nguyen, Open multilingual llm evaluation
leaderboard, 2023.
[10] N. Muennighof, N. Tazi, L. Magne, N. Reimers, Mteb: Massive text embedding benchmark, 2023.</p>
      <p>URL: https://arxiv.org/abs/2210.07316. arXiv:2210.07316.
[11] H. Zhao, M. Li, L. Sun, T. Zhou, Bento: Benchmark task reduction with in-context transferability,
2024. URL: https://arxiv.org/abs/2410.13804. arXiv:2410.13804.
[12] F. M. Polo, L. Weber, L. Choshen, Y. Sun, G. Xu, M. Yurochkin, tinybenchmarks: evaluating llms
with fewer examples, 2024. URL: https://arxiv.org/abs/2402.14992. arXiv:2402.14992.
[13] Y. Perlitz, E. Bandel, A. Gera, et al., Eficient benchmarking of language models, 2024. URL:
https://arxiv.org/abs/2308.11696. arXiv:2308.11696.
[14] R. Vivek, K. Ethayarajh, D. Yang, D. Kiela, Anchor points: Benchmarking models with much fewer
examples, 2024. URL: https://arxiv.org/abs/2309.08638. arXiv:2309.08638.
[15] P. Clark, I. Cowhey, O. Etzioni, et al., Think you have solved question answering? try arc, the ai2
reasoning challenge, 2018. URL: https://arxiv.org/abs/1803.05457. arXiv:1803.05457.
[16] K. Cobbe, V. Kosaraju, M. Bavarian, et al., Training verifiers to solve math word problems, 2021.</p>
      <p>URL: https://arxiv.org/abs/2110.14168. arXiv:2110.14168.
[17] R. Zellers, A. Holtzman, Y. Bisk, A. Farhadi, Y. Choi, Hellaswag: Can a machine really finish your
sentence?, 2019. URL: https://arxiv.org/abs/1905.07830. arXiv:1905.07830.
[18] D. Hendrycks, C. Burns, S. Basart, et al., Measuring massive multitask language understanding,
2021. URL: https://arxiv.org/abs/2009.03300. arXiv:2009.03300.
[19] K. Sakaguchi, R. L. Bras, C. Bhagavatula, Y. Choi, Winogrande: An adversarial winograd schema
challenge at scale, 2019. URL: https://arxiv.org/abs/1907.10641. arXiv:1907.10641.
[20] S. Santurkar, E. Durmus, F. Ladhak, C. Lee, P. Liang, T. Hashimoto, Whose opinions do language
models reflect?, 2023. URL: https://arxiv.org/abs/2303.17548. arXiv:2303.17548.
[21] T. B. Brown, B. Mann, N. Ryder, et al., Language models are few-shot learners, 2020. URL: https:
//arxiv.org/abs/2005.14165. arXiv:2005.14165.
[22] S. Shahriar, B. Lund, N. R. Mannuru, et al., Putting gpt-4o to the sword: A comprehensive evaluation
of language, vision, speech, and multimodal proficiency, 2024. URL: https://arxiv.org/abs/2407.
09519. arXiv:2407.09519.
[23] Microsoft, Azure openai service pricing, 2025. URL: https://azure.microsoft.com/en-us/pricing/
details/cognitive-services/openai-service/.
[24] S. Rincé, A. Banse, V. Defour, Ecologits calculator, 2025. URL: https://huggingface.co/spaces/
genai-impact/ecologits-calculator/.
[25] V. Satopaa, J. Albrecht, D. Irwin, B. Raghavan, Finding a ’kneedle’ in a haystack: Detecting knee
points in system behavior, 2011. doi:10.1109/ICDCSW.2011.20.
[26] S. Lin, J. Hilton, O. Evans, Truthfulqa: Measuring how models mimic human falsehoods, 2022.</p>
      <p>URL: https://arxiv.org/abs/2109.07958. arXiv:2109.07958.
[27] T. Sellam, D. Das, A. P. Parikh, Bleurt: Learning robust metrics for text generation, in: Proceedings
of ACL, 2020.
[28] A. Analysis, Llm leaderboard, 2025. URL: https://artificialanalysis.ai/leaderboards/models/.</p>
    </sec>
    <sec id="sec-8">
      <title>A. Supplementary Tables</title>
      <p>A.1
System prompts. Each system prompt starts with "You are a useful assistant." and ends with "Provide
your answer with the choice key in parenthesis followed by the choice sentence. Do not provide explanation
for your choice.". The following dataset-specific sections are in between.
dataset
Arc-C
GSM8k
Hellaswag
MMLU
Winogrande
Opinionqa
system prompt
You will be told a context followed by a question, and multiple choices.</p>
      <p>You will be told a grade-school level mathematical problem.</p>
      <p>You will be told a context followed by a question, and multiple choices.</p>
      <p>You will be told one or two contexts (scenarios, problem, social statement),
followed by multiple choices.</p>
      <p>You will be told a common sense context followed by sentence with a missing word,
and multiple choices.</p>
      <p>You will be told a context followed by a question on political, socio-economic aspects,
and multiple choices. The last choice ’Refused’ is the one to choose if you judge the
other choices not appropriate.</p>
      <p>A.2
Kolmogorov-Smirnov test p-values confronting normal distribution to cutof distribution over 1000
iterations.</p>
      <p>model
gpt-3.5-turbo
gpt-4o-mini</p>
      <p>Arc-C
A.1: Distribution of the indices detected for each iteration for kneedle method for each dataset (one dataset
per line) for gpt-3.5-turbo (left column) and gpt-4o-mini (right column). The orange dashed line represents the
selected cutof (mean).</p>
      <p>A.2: The tracking of the slopes (top, blue dots) for one iteration of randomizing the TruthfulQA queries, computed
over each successive 8-queries frames (1% of 817 queries), with lowess smoothing trendline (red). Tracking of the
performance score (bottom) along the same queries, reaching the final performance score (61.84) at last. The
kneedle algorithm identifies here the plateau root (dashed green line) at the 366th query.</p>
    </sec>
  </body>
  <back>
    <ref-list>
      <ref id="ref1">
        <mixed-citation>
          [1]
          <string-name>
            <given-names>S.</given-names>
            <surname>Minaee</surname>
          </string-name>
          ,
          <string-name>
            <given-names>T.</given-names>
            <surname>Mikolov</surname>
          </string-name>
          ,
          <string-name>
            <given-names>N.</given-names>
            <surname>Nikzad</surname>
          </string-name>
          , et al.,
          <source>Large language models: A survey</source>
          ,
          <year>2025</year>
          . URL: https: //arxiv.org/abs/2402.06196. arXiv:
          <volume>2402</volume>
          .
          <fpage>06196</fpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref2">
        <mixed-citation>
          [2]
          <string-name>
            <given-names>S.</given-names>
            <surname>Luccioni</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Y.</given-names>
            <surname>Jernite</surname>
          </string-name>
          , E. Strubell,
          <article-title>Power hungry processing: Watts driving the cost of ai deployment?</article-title>
          ,
          <source>in: Proceedings of the 2024 ACM Conference on Fairness, Accountability, and Transparency</source>
          , FAccT '24,
          <string-name>
            <surname>Association</surname>
          </string-name>
          for Computing Machinery, New York, NY, USA,
          <year>2024</year>
          , p.
          <fpage>85</fpage>
          -
          <lpage>99</lpage>
          . URL: https://doi.org/10.1145/3630106.3658542. doi:
          <volume>10</volume>
          .1145/3630106.3658542.
        </mixed-citation>
      </ref>
      <ref id="ref3">
        <mixed-citation>
          [3]
          <string-name>
            <given-names>S.</given-names>
            <surname>Luccioni</surname>
          </string-name>
          ,
          <string-name>
            <given-names>B.</given-names>
            <surname>Gamazaychikov</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Hooker</surname>
          </string-name>
          , et al.,
          <article-title>Light bulbs have energy ratings - so why can't ai chatbots?</article-title>
          ,
          <source>Nature</source>
          <volume>632</volume>
          (
          <year>2024</year>
          )
          <fpage>736</fpage>
          -
          <lpage>38</lpage>
          . doi:
          <volume>10</volume>
          .1038/d41586-024-02680-3.
        </mixed-citation>
      </ref>
      <ref id="ref4">
        <mixed-citation>
          [4]
          <string-name>
            <given-names>C.</given-names>
            <surname>Fourrier</surname>
          </string-name>
          ,
          <string-name>
            <given-names>N.</given-names>
            <surname>Habib</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Lozovskaya</surname>
          </string-name>
          ,
          <string-name>
            <given-names>K.</given-names>
            <surname>Szafer</surname>
          </string-name>
          , T. Wolf, Open llm leaderboard v2, https: //huggingface.co/spaces/open-llm-leaderboard/open_llm_leaderboard,
          <year>2024</year>
          .
        </mixed-citation>
      </ref>
    </ref-list>
  </back>
</article>