<!DOCTYPE article PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Archiving and Interchange DTD v1.0 20120330//EN" "JATS-archivearticle1.dtd">
<article xmlns:xlink="http://www.w3.org/1999/xlink">
  <front>
    <journal-meta>
      <journal-title-group>
        <journal-title>Workshop on AI Evaluation Beyond Metrics, July</journal-title>
      </journal-title-group>
    </journal-meta>
    <article-meta>
      <article-id pub-id-type="doi">10.1145/3495256</article-id>
      <title-group>
        <article-title>Run: Small Assessors Anticipate Big Language Models</article-title>
      </title-group>
      <contrib-group>
        <contrib contrib-type="author">
          <string-name>WoutSchellaert</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
          <xref ref-type="aff" rid="aff2">2</xref>
          <xref ref-type="aff" rid="aff3">3</xref>
          <xref ref-type="aff" rid="aff4">4</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Lexin Zhou</string-name>
          <email>lzhou@inf.upv.es</email>
          <xref ref-type="aff" rid="aff0">0</xref>
          <xref ref-type="aff" rid="aff2">2</xref>
          <xref ref-type="aff" rid="aff3">3</xref>
          <xref ref-type="aff" rid="aff4">4</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>FernandoMartínez-Plumed</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
          <xref ref-type="aff" rid="aff2">2</xref>
          <xref ref-type="aff" rid="aff3">3</xref>
          <xref ref-type="aff" rid="aff4">4</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>José Hernández-Orall</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
          <xref ref-type="aff" rid="aff2">2</xref>
          <xref ref-type="aff" rid="aff3">3</xref>
          <xref ref-type="aff" rid="aff4">4</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>CèsarFerr</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
          <xref ref-type="aff" rid="aff2">2</xref>
          <xref ref-type="aff" rid="aff3">3</xref>
          <xref ref-type="aff" rid="aff4">4</xref>
        </contrib>
        <contrib contrib-type="editor">
          <string-name>Assessor, Anticipative Reject Option, Language Model, Data Wrangling, AI Evaluation, Instance Granularity,</string-name>
        </contrib>
        <aff id="aff0">
          <label>0</label>
          <institution>For this</institution>
          ,
          <addr-line>we need an assessor: an external conditional</addr-line>
        </aff>
        <aff id="aff1">
          <label>1</label>
          <institution>Leverhulme Centre for the Future of Intelligence, University of Cambridge</institution>
        </aff>
        <aff id="aff2">
          <label>2</label>
          <institution>Valencian Research Institute for Artificial Intelligence (VRAIN), Universitat Politècnica de València</institution>
        </aff>
        <aff id="aff3">
          <label>3</label>
          <institution>Workshop Proce dings</institution>
        </aff>
        <aff id="aff4">
          <label>4</label>
          <institution>of instances), without running it through the LM at all</institution>
        </aff>
      </contrib-group>
      <pub-date>
        <year>2022</year>
      </pub-date>
      <volume>25</volume>
      <issue>2022</issue>
      <fpage>36</fpage>
      <lpage>48</lpage>
      <abstract>
        <p>Large Language Models (LMs) are expensive to operate. It would be more frugal to avoid querying them when results are predictably bad. In this paper we therefore investigate whether it is possible to granularly predict the performance of these large LMs with a much smaller external model, the assessor, which is trained on evaluation results. For instance, given an input prompt, can an assessor estimate the probability of correct completion by a giant like GPT-3 Davinci (175B parameters)? Using a data-wrangling task included in the BIG-bench repository as a case study, we find it is indeed possible, and we report results that are comparable in accuracy and calibration to the LM itself. This suggests that, at least for some tasks, a lot of compute, money, and emissions could be spared through the assessor's anticipative reject option. It also suggests that assessors can capture meaningful extra information from the evaluation procedure, and as such, could be a useful complement to simple aggregate metrics.</p>
      </abstract>
    </article-meta>
  </front>
  <body>
    <sec id="sec-1">
      <title>1. Introduction</title>
      <p>Extensive experimental research on Language Model3s
(LM) keeps showing remarkable results across severa5l
domains including mathematics, question answering, lan.-..
guage understanding, and code generati1o, n2,[3, 4, 5,
1
2
4</p>
      <p>Date &amp; Time Stored as Text
Tue Mar 14 19:09:37 CDT 2021
Jul 2nd, 2019 13:37:37 (EDT)
June 1, 2020 CMT 18:07:26
Apr 02 '17 : 0856 MDT
20:12:20 GST (02/17/21)
65534
65535
2020 Mar 14 11:15:45 AEST
11/03/2020 04:30 (YAKT)</p>
      <sec id="sec-1-1">
        <title>6, 7, 8, 9]. While the performance results for many task6s5536 Tue Feb 24 12:35:05 EEST 2022</title>
        <p>are quickly improving –on average–, there is a high
variance in the results depending on the particular task, 32the JJuuln2end1,,220012901C3M:3T7:1387:0(E7:D2T6)
instances, and the prompt1s0][. For a given task, one 4
5
can partially deal with the variability across instan.c..es
Apr 02 '17 : 0856 MDT
20:12:20 GST (02/17/21)
Date &amp; Time Stored as Text</p>
        <p>Date &amp; Time (CEST)
15/03/17 2:09:37 AM
2/07/19 19:37:37 PM
1/06/19 06:07:26 PM
02/04/02 05:08:56 PM
20/12/20 06:17:21 AM
14/03/20 11:15:45 AM
11/03/20 4:30:30 AM
24/02/22 11:35:05 AM
0.875
0.835
0.307
0.769
0.455
0.910
0.124 OUT
0.870
using the model’s decision when the probability of it6s5536 Tue Feb 24 12:35:05 EEST 2022
11/03/2020 04:30 (YAKT)
answer (i.e., its “confidence”) falls below a certain
threshold [11, 12, 13]. This requires a good calibration of thFeigure 1: (Top) Process of a LM generating the solution for
model. However, even if LMs were well calibrated, anda date transformation (repetitive) problem in a spreadsheet.
they are generally no1t0[, 14], it would also still require
actually running the inference. For large LMs this comiensstances (rows 2 &amp; onward). (Bottom) Process of an assessor
at a non-negligible cost per token, either in requiredthina-t can reliably predict beforehand the performance of the</p>
        <sec id="sec-1-1-1">
          <title>Once the user prompts one instance of the desired transformation (row 1), the LM proceeds to transforming the rest of</title>
        </sec>
        <sec id="sec-1-1-2">
          <title>LM at the instance-level.</title>
          <p>nEvelop-O
LGOBE</p>
          <p>0000-0003-1161-4270 (L. Zhou); 0000-0003-2902-6477
the calculation of whether it is actually worth asking the
• We find that lightweight assessors can give
reliable instance-level predictions of the
performance of large LMs.</p>
          <p>LM for an answer, depending on factors such as the valouef supplied examples. For instance, 5-shot inference is
of a correct result, the cost of running the model, andusoufally better than 2-shot inference, but requires more
course the performance estimated by the assessor. Seefoert from the user.</p>
          <p>Figure1 for an illustrative example. However, on many occasions the cost of running LMs</p>
          <p>This paper describes a (successful) attempt at buildinigs not negligible in both computation2a2l,[23] and
ecosuch an assessor for a collection of large LMs consistinngomic terms [24]. Large LMs, open source or not, all have
of various scales of GPT-33[] and BIG-G [10] models steep development costs in common. A recent stu2d4y][
for application to a diverse set of data-wrangling tapsuktss. the cost of developing a LM with only 1.5 billion
Data-wrangling1[6, 17] is a notoriously time consuming parameters at $1.6 million. Inference costs is another
data preparation chore where LMs have recently shodwrnain. [25] estimates the cost of running GPT-3, if run in
promising results1[8]. We discuss data-wrangling andthe cloud, at a minimum of $87,000 per year, with current
the considerations regarding the use of LMs in sectiAoPnI price for Davinci being 6 cents per 750 wor1.dOsf
2.1 and2.2. course, these costs go down quickly as compute becomes
cheaper, but larger models are expected to replace the</p>
          <p>Contributions old ones quickly to set the new state of the art. Also, as
To our knowledge, this is the first paper analysing LMs increase their performance, their penetration rates
assessors applied to the language domain, and to a will increase, becoming widespread in billions of
semiplausible use case in general. Additionally, automated operations in many domains, and compute
might easily become more of an issue, not less.</p>
          <p>2.2. Data Wrangling
• We find that their predictions are
well-calibrated and unbiased, again
comparable to the self-assessment of the
LMs.</p>
          <p>Data-wrangling1[6, 17] is a data preparation task that
data janitors, data scientists and other people operating
with forms, spreadsheets and other data formatting
situations consider a very monotonous and laborious part
• We investigate the contributions of various of their jobs.Data wrangling can require as much as 80
faesasteussreosr lpikeerfo#rsmhoatnscae.nd #parameters to percent of their time26[], including tediously
transforming data presented from heterogeneous formats into a
standardised format for eficient access, understanding,
and analysis. One of the challenges in data-wrangling
automation consists of selecting the correct (string)
trans2. Background formations from the vast set of possible ones, and doing
so by only having seen a few examples2[7]. Many
apIn this section we revisit some key ideas of LMs, theiprroaches have attempted to address this challenge by
costs, their applications to the data wrangling probrleemd,ucing the transformation space through the
incorpoand the traditional (post-hoc) reject option. We also sruamt-ion of prior knowledg2e8[, 29]. This led to a many
marise the main elements of the recently introduced cotno-ols that use domain-specific languages or needing
adcept of assessor models. hoc solutions 3[0].</p>
          <p>Because LMs capture vast amounts of human
knowl2.1. (Large) Language Models edge across many diferent domains, they can be specially
efective for more open-ended tasks, and as such data
In less than a decade, research in Natural Language Pwror-angling is recognised in data science automat3io1n]. [
cessing (NLP) has been overturned by the appearancUesing few-shot inference3[, 32, 33], LMs have shown
of a suite of LMs trained in an unsupervised manner opnromising yet unreliable results for data wrangling. For
very large corpora. LMs are capturing more and moirnestance, in1[8] GPT-3 Davinci (prompted, not finetuned)
of the information in natural language, including the lainc-hieves a56% accuracy in the 1-shot settin68g%, with
guistic characteristics of various human languages antdhe 4-shot setting, and almo9s0t% with 10 shots.
Addiassociated knowledge. Moreover, these models can bteionally, as opposed to LM results on other tasks, GPT-3
adapted (e.g., through fine-tuning) to a wide range oisf also relatively well calibrated in the data-wrangling
downstream tasks8][. Recent LMs such as GPT-3 3[], task, reporting a Brier score of 0.11 (see sect4)io.n
PanGu- [19], GLaM [20] and OPT [21] have excelled at
few-shot inference, where a task is solved by supplying a
small set of correct examples formatted apsroampt. The
quality of the completion usually depends on the number 1https://openai.com/api/pricing/
2.3. Reject option</p>
        </sec>
      </sec>
    </sec>
    <sec id="sec-2">
      <title>3. Methods</title>
      <p>Given these unreliable accuracies but good calibratioInnthis section we identify the experimental setting,
inscores, we could have a more reliable and efective usecluding goals of the analysis, the data sources and how
of these systems by not using those for which the confi-they are converted into evaluation records, and how we
dence of the system is low. In other words, if we knowbuild the assessor from them.
for which instances the LM is (likely to be) wrong, we
can abstain from using the output of the LM in these
cases. This is called a ‘reject option’, and a classic an3d.1. Experimental Questions
straightforward implementation for it is to use a coWnefi- set three experimental questions:
dence threshol d and compare it with the probability
(| )̂ that a mode l assigns to its output.̂This repre- Q1: Can we build lightweight yet good assessors for
sents the self-assigned probability of being correct (i.e., language models in this domain?
its confidence) [11, 12, 13]. We set to match the error
tnoolteruasencteheofotuhtepuutseofcatshee, maonddewl, hu(es| nu)̂a&lt;lly dele, gwaetidnog to aQ2: lAarnegutahgeeamsosedseslosrwshoefnceosmtpimaartaibnleg pqruoabliatbyiltitoietsh?e
human. However, this classical interpretation of the reQ-3: What features from the systems and the instances
ject rule still requires running the model. As mentioned are most relevant for predicting success and
conbefore, this can be expensive for large LMs. Whenever sequently for building good assessors?
the reject rule triggers, it is not only that humans need
to do the task manually, but we have also incurred a cost
in the computation of a model that is efectively waste3d..2. Data Sources and Train-Test Split</p>
      <p>We work with the Data Wrangling Dataset Repos2i,tory
2.4. Assessors containing 119 tasks from 7 domains (dates, emails, free
text, names, phones, times, and units). In particular, we
Assessor models 1[5] provide an externaalnticipative use results (at instance level) from multiple LMs obtained
reject option instead. Assessors are conditional profrboam- two diferent evaluation eforts. First3,4[] have
bility (or density) estimat o( r|̂,s) that are trained onproduced granular results of the evaluation of diferent
evaluation data. With ‘evaluation data’ we mean a sevteorfsions of GPT-3. We have 146k instances available for
evaluation record⟨s, , ⟩ , where refers to a profile or GPT-3 models Ada (350M), Babbage (1.3B), Curie (6.7B),
description of a particular system (e.g., deployment coann-d Davinci (175B), from 0-shot to 10-shot. More
inditions, state, system architecture, or hyperparametfeorrsm),ation about the architectures can be foun3d].in [
 refers to a particular instance (e.g., a prompt), taond Second, [10] provides results on the same benchmark for
an empirical measurement of the performance oofn  . a collection of Google LMs of various parameter sizes.</p>
      <p>Assessor models are meant to act as general mappingHsere we extract 86k instances, fro0-mshot to3-shot,
between the space of systems, the space of instancefos,r 22 models with parameter sizes ranging from 2M to
and the corresponding distribution of scores. They ar1e2a8B across two diferent model families, a decoder-only
way of capturing all available evaluation informationdinense transformer (BIG-G dense) and a sparse
Mixturea single predictive model that could be used, e.g., to inof--Experts 3[5] model (BIG-G sparse). More information
vestigate what features make an instance dificult, to adodn the BIG-G network architectures is available10i]n. [
confidence capabilities to systems that do not have theAm,ll models (GPT-3 and BIG-G variants) were queried with
or to select the optimal model for a specific instance. Intemperature set to 0, and none of them were fine-tuned
this case, we focus on their use to provide an anticipfoar- the data-wrangling task.
tive reject option: wh e n̂is built and shown to be an As the assessor is trained on a somewhat
heterogeaccurate estimator, we can use it to make inferencesnoenous collection of systems and instances, we have to be
the expected performan c( ê= 1|, ) given a system careful to define a train-test partition of the evaluation
and instance (or a collection of those). results without contamination or information leakage.</p>
      <p>We do still have to run actual inference on the assessToor,this purpose, we must ensure that the same instances
but as we show in the experiments, they have the possair-e consistently used across systems and shots. For
exbility of being multiple orders of magnitude smaller thaanmple, we have to avoid that the result of BIG-G dense
the LMs, allowing us to cheaply avoid any LM inferencweith 2-shots on instanceis in the training set, while
that is doomed to fail. GPT-3 Ada’s result with 0-shots on the same instance is
in the test set. Figur2eshows a visual representation of
the partition requirements that ensure that this does not</p>
      <sec id="sec-2-1">
        <title>3 for an example). We refer to29[] for an overview. The</title>
        <p>binary metafeatures are available for all input and output
that is in the prompt, so for example for a 2-shot prompt,
we would have 2 inputs and 2 outputs from the examples,
and 1 input for the actual question, tota5ll⋅in54g = 270
features.</p>
        <p>Prompted data
3.3. Anatomy of the Evaluation Record
happen. The order-matched train-test partition leadFsotrothe data wrangling tasks, all scores are binary: 1 if
194k training instances and 38k testing instances. the output of the LM matches the target string exactly,
and 0 otherwise. The score is what the assessors must
predict, and thus acts as a label during training.</p>
        <p>From our two data sources, we receive records of the
shape ⟨system id, #shots, prompt, sco⟩r.eWe further an- 3.4. Assessor Building and Evaluation
notate this record with features describing the sysFtoemr the assessor model, we train a Random For3e6s]t [
( ), and extract meta features of the inst an)cteha(t are of 100 decision trees, a minimum node size of 5, and
ift for tabular representation (as opposed to free fosremlect randomly 50% of the available variables in each
text). In the end, this creates a general record of the shsappliet, tuned through grid search on a validation set
us⟨, , ⟩ = ⟨⟨ system feature⟩s,⟨instance featur⟩e,sscore⟩. ing 84%-16% training-validation sp4liftrom the training
We describe these features in detail below, but ultimateselyt defined previously. For the remaining
hyperparamethe only constraint for making a useful assessor is thtaetrs the defaults were us5e.dWe report the Area Under
all system and instance features are available withoRuetceiver Operating Characteristic Curve (AUROC) and
actually running the original model. Brier Score (BS), as well as its decomposition into
calibration and refinement loss3[7, 38, 39]
3.3.1. System features As a baseline to compare assessors to, we take the
standard approach of interpreting the proba(b| i)̂lity
The available system features include a system id that</p>
        <p>the LM assigns to its outp û tas the “confidence” of the
refers to a specific trained LM, i.e., a set of learned
pa</p>
        <p>model, i.e. its self-assessed probability of being correct.
rameters fitting a certain architecture, the id of thatHoawr-ever, there is no da t( a|)̂ recorded in the
BIGchitecture (either GPT-3, BIG-G sparse, or BIG-G dense),</p>
        <p>bench logs, so we cannot compare the assessor AUROC
whether a model is dense or sparse, and the number of</p>
        <p>or BS to those of the BIG-G family of models. For GPT-3
parameters. These features will of course be the same</p>
        <p>this information is available.
for all records of the same trained model.
3.3.2. Instance features</p>
      </sec>
    </sec>
    <sec id="sec-3">
      <title>4. Results and Discussion</title>
      <p>Instance features include the number of shots, the idSoi nfce it is assumed that all models give better results with
the prompt-templat3,eand 54 simple binary metafeatures + 1 shots than wit hshots, Figure4 shows the
accuthat can be automatically extracted through simple rergauc-ies of the LMs, with the maximum number of shots
lar expressions from the original text. Examples include
the kind of symbols the instance contains (e.g., numbers, 4The non-standard train-test partition is the result of the
indots, dashes) or whether it starts with a digit (see Figsutarnece matching procedure described in sect3i.o2n.
5The RandomForest package (https://cran.r-project.org/web/
3The prompt-template difers between1[0] and [34], but the packages/randomForest/index.ht)mwlas used for training the
assessame metafeatures can be extracted. sor model.
used in BIG-G data (3-shots), the same number of shotTsable 2
for GPT-3 (for comparability), and the maximum usedAUROC and BS (Calibration, Refinement) for BIG-G data
(10-shots) on the original data-wrangling tasks. Despuitsieng a single assessor trained with both GPT-3 and BIG-G
the promising progress in the state-of-the-art capadbailtia- using all available features (except prompt-template id).
ties of LMs, they still struggle to master data-wranglIinntghe bottom row, AUROC and BS are not averaged, but
tasks with very few shots. For 3-shot inference, BIG-cGalculated from the aggregated set of instances. The average
dense 128B achieves an accuracy of 0.776, outperformingaccuracies of the LM (with std. dev. across #shots) on the
original data-wrangling task are also presented, and serve as
BIG-G sparse 8B and GPT-3 175B. When it comes to 10-an indication of the class distribution the assessor has to deal
shots (only GPT-3 was available), GPT-3 175B achieves awith.
promising accuracy of nearly 90%, outperforming other
GPT-3 variants. id LM Acc.</p>
      <p>Table 1 describes the AUROC and BS —decomposed
into calibration loss (CAL) and refinement loss (REF)—
for the GPT-3 data given by an assessor trained wiitnhstance features on the performance of the assessor.
all features available except for the prompt-templateAnalysing the results in Tab1leand Table2, we see
id, along with the GPT-3’s self-assessment on the samreelatively good results overall in the assessor’s
perforset of instances. Tabl2edescribes the AUROC and BS mance, reporting AUROCs of around 0.9, and BSs around
—decomposed into calibration loss (CAL) and refinement0.12 (Q1). It should be noted that the metrics for the
loss (REF)— for the BIG-G data given by the assessosrm.allest LMs have to be interpreted cautiously due to the
Finally, Table3 shows the impact of various system andsignificant imbalance in LM scores distribution (i.e., for
Table 3 tecture is not indicative of major performance diferences
Ablation study of the impact of various features on assessor (or the assessor fails to pick up on them).
performance. The 54 instance metafeatures are always in- Finally, we discuss a concrete example using the
ascluded. Row 4, in italics, indicates the assessor we reported in sessor to implement a reject rule (see Ta?b?le). For the
Table 1 and Table 2. GPT-3 data in the test set (24604 instances), we take a
systemid#parametteemrsplatefiadm. &amp;sp#asrhs.otsAUROC ( ̂) BS (CAL, REF) ( ̂) craeesesjeedcs.tsTotrhhedreeaseshmsoessldistoo1irsf%lr,eeis.jese.,tctwhsae1na%rbelojiekucettly5int34hst0eaiLnnMsctewsaonwuchledesr,seuwcth-hiceh
1 • 0.909 0.130 (0.015, 0.115) account for21.7% of the instances and (approximately)
32 •• • •• 00..991102 00..112370 ((00..001145,, 00..111135)) the total compute. From these 5340, we have that 5114
4 • • • • 0.916 0.128 (0.017, 0.111) are correctly rejected, represent4i6n%gof the failures,
65 •• • •• 00..991176 00..112266 ((00..001155,, 00..111111)) at the cost of only 226 correct answers being rejected
7 • • • 0.916 0.128 (0.016, 0.112) (about1.5%).
89 • • •• 00..886689 00..116687 ((00..003310,, 00..113377)) Therefore, a lot of compute, money, and emissions
10 • 0.865 0.170 (0.033, 0.137) would be saved since the assessor is far smaller than the
LMs in terms of parameters and inference time.
Concretely, the proposed assessor has 100 decision trees of
very low accuracies it is easy to predict that the LM w(aipllproximately) 20000 nodes, whose inference time is in
usually fail). We also see that, in general, performanctehe order o1f00 ⋅ log2 (20000) ≈ 1450 comparisons, much
is worse for the BIG-G models than for the GPT-3 onessm.aller than what LMs required for one pass through its
This could be due to GPT-3 being more predictable, orbillions parameters.
from the availability of more data for GPT-3 (possibly
making the assessor pay more attention to the majToarb-le 4
ity model family in its generalisation). This observatioCnonfusion matrix with reject threshold &lt; 0.01 of assessor
suggests that the distribution of results of each systperemdictions for GPT-3. The 0 and 1 represent wrong and correct
afects the performance of the assessor accordingly. If wreesponses by the LM respectively.
would like to focus on building an assessor for a specific Actual
LM, techniques like instance weights or oversampling failure correct
couClodmhpaavreinagntehfeecste. results with GPT-3’s self-assessment Predicted fcaoirlruercet 65010144 12324681
in Table1, we can conclude that the assessor performs
slightly worse than GPT-3, but is definitely comparable
(Q2). A significant part of the diference in BS comes
from the calibration (CAL) term, and not from the 5re.- Conclusions and Future work
ifnement (REF) term, which is very similar for the LMs
and the assessor, especially for the smaller versions Wofe have illustrated how a small assessor can manage
GPT-3. This suggests that post-hoc calibration methopdesrformance expectations at a level that is comparable
[40] like isotonic regression could still improve resulttso the self-assessment of giant language models with
significantly. billions of parameters. We have shown the assessor can</p>
      <p>In the feature importance study in Ta3b,lewe can be well calibrated and make refined predictions. We find
see that using either the system id or the number tohfat the assessor picks up on system features like id or #
parameters improves performance significantly, likelyparameters that explain large variances in performance.
because both can indicate the scale of the system, whicWhe showcase how they can be used to reject instances
highly correlates with performance. The use of system ibdefore running much larger language models, resulting
generalises slightly worse than #parameters. Other fienaa- significant saving of compute.
tures, like #shots, prompt-template id, or model family There are of course some limitations to this work. For
and sparsity indicators have less efect on the perfoerx-ample, the instance metafeatures are specific to the
mance (Q3). The assessor can easily derive the #shotussed data-wrangling tasks. Nonetheless, the positive
from the input (more examples results in more featurersesults hint at future work. There are still many ways of
being present), so this makes sense. We did not measurdeirectly improving the assessor we have used here. For
any efect on aggregated performance from the diferentinstance, we could use post-hoc calibration with methods
prompt-templates, and it is likely this feature is simpslyuch as isotonic regression, or add instance weights to
non-informative. Regarding model family and sparsittyh,e results of systems we especially care about. There are
we hypothesise that there is a large overlap betweaelnso many questions to further investigate. Do assessors
which instances the LMs solve correctly, so model archwi-ork for other tasks? Can we use a small LM instead
of a random forest to allow free form input? What i[s4] E. Kharitonov, A. Lee, A. Polyak, Y. Adi,
the agreement between diferent systems, and with the J. Copet, K. Lakhotia, T.-A. Nguyen, M. Rivière,
assessor? A. Mohamed, E. Dupoux, W.-N. Hsu,
Text</p>
      <p>These future ideas could be useful from the perspective Free Prosody-Aware Generative Spoken Language
of saving computing costs as we outlined before, but the Modeling, arXiv:2109.03264 [cs, eess] (2021).
schema is of wider applicability. There is a lot of useful arXiv:2109.03264.
information generated during the evaluation process th[5a]t J. W. Rae, S. Borgeaud, T. Cai, K. Millican, J.
Hofis lost upon aggregation. Assessors are an attempt at mann, F. Song, J. Aslanides, S. Henderson, R. Ring,
capturing this information and providing expectation S. Young, et al., Scaling language models: Methods,
management that is external, fine grained, anticipative, analysis &amp; insights from training Gopher, arXiv
and can make use of population data. We could use them preprint arXiv:2112.11446 (2021).
as instance-level model selectors, or we might be able[6] D. Hendrycks, C. Burns, S. Basart, A. Zou,
apply explainability techniques on the assessor to find M. Mazeika, D. Song, J. Steinhardt, Measuring
out what makes an instance dificult. massive multitask language understanding, arXiv</p>
      <p>There is definitely more to explore around the topic of preprint arXiv:2009.03300 (2020).
assessors, which perform granular assessments beyond[7] D. Hendrycks, S. Basart, S. Kadavath, M. Mazeika,
generic aggregated results: saving compute by rejecting A. Arora, E. Guo, C. Burns, S. Puranik, H. He,
examples where the original model is going to fail is an D. Song, J. Steinhardt, Measuring coding challenge
important illustrative application. competence with APPS, 2021a.rXiv:2105.09938.
[8] R. Bommasani, et al., On the opportunities
and risks of foundation models, arXiv preprint
Acknowledgments arXiv:2108.07258, 2021.</p>
      <p>[9] L. Ouyang, J. Wu, X. Jiang, D. Almeida, C. L.
WainWe thank the anonymous reviewers for their comments. wright, P. Mishkin, C. Zhang, S. Agarwal, K. Slama,
This work has been partially supported by the Norwe- A. Ray, et al., Training language models to follow
gian Research Council grant 329745 Machine Teach- instructions with human feedback, arXiv preprint
ing for Explainable AI, also by the EU (FEDER) and arXiv:2203.02155 (2022).</p>
      <p>Spanish MINECO grant RTI2018-094403-B-C32 funded [10] A. Srivastava, A. Rastogi, et al., Beyond the
imiby MCIN/AEI/10.13039/501100011033 and by “ERDF A tation game: Quantifying and extrapolating the
way of making Europe”, Generalitat Valenciana under capabilities of language models, 2022. URLh:ttps:
grant PROMETEO/2019/098, EU’s Horizon 2020 research
and innovation programme under grant agreement No. //arxiv.org/abs/2206.0461.5 doi:10.48550/ARXIV.
952215 (TAILOR), US DARPA HR00112120007 (RECoG- 2206.04615.</p>
      <p>AI), and INNEST/2021/317 (Project cofunded by the Eu-[11] R. Herbei, M. H. Wegkamp, Classification with
reject option, The Canadian Journal of Statistics/La
ropean Union with the “Programa Operativo del Fondo Revue Canadienne de Statistique (2006) 709–721.
Europeo de Desarrollo Regional (FEDER) de la Comunitat</p>
      <p>[12] F. Tortorella, An optimal reject rule for binary
clasValenciana 2014-2020”) and ”the UPV (Vicerrectorado de sifiers, in: Joint IAPR International Workshops on
Investigación) grant PAI-10-21”. Statistical Techniques in Pattern Recognition (SPR)
and Structural and Syntactic Pattern Recognition
References (SSPR), Springer, 2000, pp. 611–620.</p>
      <p>[13] K. Hendrickx, L. Perini, D. Van der Plas, W. Meert,
[1] J. Devlin, M.-W. Chang, K. Lee, K. Toutanova, J. Davis, Machine learning with a reject option: A
Bert: Pre-training of deep bidirectional transform- survey, arXiv preprint arXiv:2107.11277 (2021).
ers for language understanding, arXiv prepri[n1t4] Z. Jiang, J. Araki, H. Ding, G. Neubig, How
arXiv:1810.04805 (2018). Can We Know When Language Models Know?
[2] C. Rafel, N. Shazeer, A. Roberts, K. Lee, S. Narang, On the Calibration of Language Models for
QuesM. Matena, Y. Zhou, W. Li, P. J. Liu, Exploring the tion Answering, Transactions of the Association
limits of transfer learning with a unified text-to- for Computational Linguistics 9 (2021) 962–977.
text transformer, arXiv preprint arXiv:1910.10683 doi:10.1162/tacl_a_00407.</p>
      <p>(2019). [15] J. Hernández-Orallo, W. Schellaert, F.
Martınez[3] T. B. Brown, B. Mann, N. Ryder, M. Subbiah, J. Ka- Plumed, Training on the test set: Mapping the
plan, P. Dhariwal, A. Neelakantan, P. Shyam, G. Sas- system-problem space in AI, Proceedings of the
try, A. Askell, et al., Language models are few-shot AAAI Conference on Artificial Intelligence (2022).
learners, in: Advances in Neural Information Pr[o1-6] S. Kandel, J. Heer, C. Plaisant, J. Kennedy,
cessing Systems, volume 33, 2020, pp. 1877–1901. F. Van Ham, N. H. Riche, C. Weaver, B. Lee, D.
Brod</p>
    </sec>
  </body>
  <back>
    <ref-list />
  </back>
</article>