=Paper=
{{Paper
|id=Vol-3169/paper4
|storemode=property
|title=Reject Before You Run: Small Assessors Anticipate Big Language Models
|pdfUrl=https://ceur-ws.org/Vol-3169/paper4.pdf
|volume=Vol-3169
|authors=Lexin Zhou,Fernando Martínez-Plumed,José Hernández-Orallo,Cèsar Ferri,Wout Schellaert
|dblpUrl=https://dblp.org/rec/conf/ijcai/ZhouMHFS22
}}
==Reject Before You Run: Small Assessors Anticipate Big Language Models==
Reject Before You Run: Small Assessors Anticipate Big
Language Models
Lexin Zhou1 , Fernando Martínez-Plumed1 , José Hernández-Orallo1,2 , Cèsar Ferri1 and
Wout Schellaert1
1
Valencian Research Institute for Artificial Intelligence (VRAIN), Universitat Politècnica de València
2
Leverhulme Centre for the Future of Intelligence, University of Cambridge
Abstract
Large Language Models (LMs) are expensive to operate. It would be more frugal to avoid querying them when results are
predictably bad. In this paper we therefore investigate whether it is possible to granularly predict the performance of these
large LMs with a much smaller external model, the assessor, which is trained on evaluation results. For instance, given an
input prompt, can an assessor estimate the probability of correct completion by a giant like GPT-3 Davinci (175B parameters)?
Using a data-wrangling task included in the BIG-bench repository as a case study, we find it is indeed possible, and we report
results that are comparable in accuracy and calibration to the LM itself. This suggests that, at least for some tasks, a lot
of compute, money, and emissions could be spared through the assessor’s anticipative reject option. It also suggests that
assessors can capture meaningful extra information from the evaluation procedure, and as such, could be a useful complement
to simple aggregate metrics.
Keywords
Assessor, Anticipative Reject Option, Language Model, Data Wrangling, AI Evaluation, Instance Granularity,
1. Introduction 1
Date & Time Stored as Text
Tue Mar 14 19:09:37 CDT 2021
Date & Time (CEST)
15/03/17 2:09:37 AM IN
2 Jul 2nd, 2019 13:37:37 (EDT) 2/07/19 19:37:37 PM 0.875
Extensive experimental research on Language Models 3
4
June 1, 2020 CMT 18:07:26
Apr 02 '17 : 0856 MDT
1/06/19 06:07:26 PM
02/04/02 05:08:56 PM
0.835
0.307
(LM) keeps showing remarkable results across several 5 20:12:20 GST (02/17/21) 20/12/20 06:17:21 AM 0.124 OUT
... ... ...
domains including mathematics, question answering, lan- 65534 2020 Mar 14 11:15:45 AEST 14/03/20 11:15:45 AM 0.769
guage understanding, and code generation [1, 2, 3, 4, 5, 65535 11/03/2020 04:30 (YAKT) 11/03/20 4:30:30 AM 0.455
Tue Feb 24 12:35:05 EEST 2022 24/02/22 11:35:05 AM LM
6, 7, 8, 9]. While the performance results for many tasks
65536 0.910
are quickly improving –on average–, there is a high vari- Date & Time Stored as Text Date & Time (CEST)
2 Jul 2nd, 2019 13:37:37 (EDT) 0.870
ance in the results depending on the particular task, the 3 June 1, 2020 CMT 18:07:26 0.832 IN
instances, and the prompts [10]. For a given task, one 4 Apr 02 '17 : 0856 MDT 0.250
5 20:12:20 GST (02/17/21) 0.123 OUT
can partially deal with the variability across instances ... ... ...
through a traditional reject rule, where we abstain from 65534 2020 Mar 14 11:15:45 AEST 0.770 Assessor
65535 11/03/2020 04:30 (YAKT) 0.315
using the model’s decision when the probability of its 65536 Tue Feb 24 12:35:05 EEST 2022 0.915
answer (i.e., its “confidence”) falls below a certain thresh-
old [11, 12, 13]. This requires a good calibration of the Figure 1: (Top) Process of a LM generating the solution for
model. However, even if LMs were well calibrated, and a date transformation (repetitive) problem in a spreadsheet.
Once the user prompts one instance of the desired transfor-
they are generally not [10, 14], it would also still require
mation (row 1), the LM proceeds to transforming the rest of
actually running the inference. For large LMs this comes instances (rows 2 & onward). (Bottom) Process of an assessor
at a non-negligible cost per token, either in required in- that can reliably predict beforehand the performance of the
LM at the instance-level.
EBeM’22: Workshop on AI Evaluation Beyond Metrics, July 25, 2022,
Vienna, Austria
Envelope-Open lzhou@inf.upv.es (L. Zhou); fermarpl@dsic.upv.es
frastructure or through the price of the API. To avoid
(F. Martínez-Plumed); jorallo@dsic.upv.es (J. Hernández-Orallo);
cferri@dsic.upv.es (C. Ferri); wschell@vrain.upv.es (W. Schellaert) being wasteful, we explore how much we can anticipate
GLOBE https://lexzhou.github.io/ (L. Zhou); https://nandomp.github.io/ the level of success for a particular instance (or collection
(F. Martínez-Plumed); http://josephorallo.webs.upv.es/ of instances), without running it through the LM at all.
(J. Hernández-Orallo); http://personales.upv.es/ceferra/ (C. Ferri); For this, we need an assessor: an external conditional
https://schellaert.org/ (W. Schellaert)
probability (or density) estimator that can reliably pre-
Orcid 0000-0003-1161-4270 (L. Zhou); 0000-0003-2902-6477
(F. Martínez-Plumed); 0000-0001-9746-7632 (J. Hernández-Orallo); dict beforehand the performance of an LM at instance
0000-0002-8975-1120 (C. Ferri); 0000-0002-9182-4747 (W. Schellaert) granularity [15]. With a good assessor, we could make
© 2022 Copyright for this paper by its authors. Use permitted under Creative
Commons License Attribution 4.0 International (CC BY 4.0). the calculation of whether it is actually worth asking the
CEUR
Workshop
Proceedings
http://ceur-ws.org
ISSN 1613-0073
CEUR Workshop Proceedings (CEUR-WS.org)
LM for an answer, depending on factors such as the value of supplied examples. For instance, 5-shot inference is
of a correct result, the cost of running the model, and of usually better than 2-shot inference, but requires more
course the performance estimated by the assessor. See effort from the user.
Figure 1 for an illustrative example. However, on many occasions the cost of running LMs
This paper describes a (successful) attempt at building is not negligible in both computational [22, 23] and eco-
such an assessor for a collection of large LMs consisting nomic terms [24]. Large LMs, open source or not, all have
of various scales of GPT-3 [3] and BIG-G [10] models steep development costs in common. A recent study [24]
for application to a diverse set of data-wrangling tasks. puts the cost of developing a LM with only 1.5 billion
Data-wrangling [16, 17] is a notoriously time consuming parameters at $1.6 million. Inference costs is another
data preparation chore where LMs have recently shown drain. [25] estimates the cost of running GPT-3, if run in
promising results [18]. We discuss data-wrangling and the cloud, at a minimum of $87,000 per year, with current
the considerations regarding the use of LMs in section API price for Davinci being 6 cents per 750 words1 . Of
2.1 and 2.2. course, these costs go down quickly as compute becomes
cheaper, but larger models are expected to replace the
Contributions old ones quickly to set the new state of the art. Also, as
To our knowledge, this is the first paper analysing LMs increase their performance, their penetration rates
assessors applied to the language domain, and to a will increase, becoming widespread in billions of semi-
plausible use case in general. Additionally, automated operations in many domains, and compute
• We find that lightweight assessors can give might easily become more of an issue, not less.
reliable instance-level predictions of the
performance of large LMs. 2.2. Data Wrangling
• We find that their predictions are
Data-wrangling [16, 17] is a data preparation task that
well-calibrated and unbiased, again
comparable to the self-assessment of the data janitors, data scientists and other people operating
LMs. with forms, spreadsheets and other data formatting sit-
uations consider a very monotonous and laborious part
• We investigate the contributions of various
of their jobs.Data wrangling can require as much as 80
features like #shots and #parameters to
assessor performance. percent of their time [26], including tediously transform-
ing data presented from heterogeneous formats into a
standardised format for efficient access, understanding,
and analysis. One of the challenges in data-wrangling
automation consists of selecting the correct (string) trans-
2. Background formations from the vast set of possible ones, and doing
so by only having seen a few examples [27]. Many ap-
In this section we revisit some key ideas of LMs, their proaches have attempted to address this challenge by
costs, their applications to the data wrangling problem, reducing the transformation space through the incorpo-
and the traditional (post-hoc) reject option. We also sum- ration of prior knowledge [28, 29]. This led to a many
marise the main elements of the recently introduced con- tools that use domain-specific languages or needing ad-
cept of assessor models. hoc solutions [30].
Because LMs capture vast amounts of human knowl-
2.1. (Large) Language Models edge across many different domains, they can be specially
effective for more open-ended tasks, and as such data
In less than a decade, research in Natural Language Pro- wrangling is recognised in data science automation [31].
cessing (NLP) has been overturned by the appearance Using few-shot inference [3, 32, 33], LMs have shown
of a suite of LMs trained in an unsupervised manner on promising yet unreliable results for data wrangling. For
very large corpora. LMs are capturing more and more instance, in [18] GPT-3 Davinci (prompted, not finetuned)
of the information in natural language, including the lin- achieves a 56% accuracy in the 1-shot setting, 68% with
guistic characteristics of various human languages and the 4-shot setting, and almost 90% with 10 shots. Addi-
associated knowledge. Moreover, these models can be tionally, as opposed to LM results on other tasks, GPT-3
adapted (e.g., through fine-tuning) to a wide range of is also relatively well calibrated in the data-wrangling
downstream tasks [8]. Recent LMs such as GPT-3 [3], task, reporting a Brier score of 0.11 (see section 4).
PanGu-𝛼 [19], GLaM [20] and OPT [21] have excelled at
few-shot inference, where a task is solved by supplying a
small set of correct examples formatted as a prompt. The
quality of the completion usually depends on the number 1
https://openai.com/api/pricing/
2.3. Reject option 3. Methods
Given these unreliable accuracies but good calibration In this section we identify the experimental setting, in-
scores, we could have a more reliable and effective use cluding goals of the analysis, the data sources and how
of these systems by not using those for which the confi- they are converted into evaluation records, and how we
dence of the system is low. In other words, if we know build the assessor from them.
for which instances the LM is (likely to be) wrong, we
can abstain from using the output of the LM in these
cases. This is called a ‘reject option’, and a classic and 3.1. Experimental Questions
straightforward implementation for it is to use a confi- We set three experimental questions:
dence threshold 𝑡 and compare it with the probability
𝑝(𝑦|𝑥)
̂ that a model 𝑝 assigns to its output 𝑦.̂ This repre- Q1: Can we build lightweight yet good assessors for
sents the self-assigned probability of being correct (i.e., language models in this domain?
its confidence) [11, 12, 13]. We set 𝑡 to match the error
tolerance of the use case, and when 𝑝(𝑦|𝑥) ̂ < 𝑡 , we do Q2: Are the assessors of comparable quality to the
not use the output of the model, usually delegating to a language models when estimating probabilities?
human. However, this classical interpretation of the re- Q3: What features from the systems and the instances
ject rule still requires running the model. As mentioned are most relevant for predicting success and con-
before, this can be expensive for large LMs. Whenever sequently for building good assessors?
the reject rule triggers, it is not only that humans need
to do the task manually, but we have also incurred a cost
in the computation of a model that is effectively wasted. 3.2. Data Sources and Train-Test Split
We work with the Data Wrangling Dataset Repository2 ,
2.4. Assessors containing 119 tasks from 7 domains (dates, emails, free
text, names, phones, times, and units). In particular, we
Assessor models [15] provide an external anticipative use results (at instance level) from multiple LMs obtained
reject option instead. Assessors are conditional proba- from two different evaluation efforts. First, [34] have
̂
bility (or density) estimators 𝑅(𝑟|𝜋, 𝜇) that are trained on produced granular results of the evaluation of different
evaluation data. With ‘evaluation data’ we mean a set of versions of GPT-3. We have 146k instances available for
evaluation records ⟨𝜋, 𝜇, 𝑟⟩, where 𝜋 refers to a profile or GPT-3 models Ada (350M), Babbage (1.3B), Curie (6.7B),
description of a particular system (e.g., deployment con- and Davinci (175B), from 0-shot to 10-shot. More in-
ditions, state, system architecture, or hyperparameters), formation about the architectures can be found in [3].
𝜇 refers to a particular instance (e.g., a prompt), and 𝑟 to Second, [10] provides results on the same benchmark for
an empirical measurement of the performance of 𝜋 on 𝜇. a collection of Google LMs of various parameter sizes.
Assessor models are meant to act as general mappings Here we extract 86k instances, from 0-shot to 3-shot,
between the space of systems, the space of instances, for 22 models with parameter sizes ranging from 2M to
and the corresponding distribution of scores. They are a 128B across two different model families, a decoder-only
way of capturing all available evaluation information in dense transformer (BIG-G dense) and a sparse Mixture-
a single predictive model that could be used, e.g., to in- of-Experts [35] model (BIG-G sparse). More information
vestigate what features make an instance difficult, to add on the BIG-G network architectures is available in [10].
confidence capabilities to systems that do not have them, All models (GPT-3 and BIG-G variants) were queried with
or to select the optimal model for a specific instance. In temperature set to 0, and none of them were fine-tuned
this case, we focus on their use to provide an anticipa- for the data-wrangling task.
tive reject option: when 𝑅̂ is built and shown to be an As the assessor is trained on a somewhat heteroge-
accurate estimator, we can use it to make inferences on neous collection of systems and instances, we have to be
the expected performance 𝑅(𝑟 ̂ = 1|𝜋, 𝜇) given a system 𝜋 careful to define a train-test partition of the evaluation
and instance 𝜇 (or a collection of those). results without contamination or information leakage.
We do still have to run actual inference on the assessor, To this purpose, we must ensure that the same instances
but as we show in the experiments, they have the possi- are consistently used across systems and shots. For ex-
bility of being multiple orders of magnitude smaller than ample, we have to avoid that the result of BIG-G dense
the LMs, allowing us to cheaply avoid any LM inference with 2-shots on instance 𝑖 is in the training set, while
that is doomed to fail. GPT-3 Ada’s result with 0-shots on the same instance is
in the test set. Figure 2 shows a visual representation of
the partition requirements that ensure that this does not
2
http://dmip.webs.upv.es/datawrangling/
3 for an example). We refer to [29] for an overview. The
binary metafeatures are available for all input and output
that is in the prompt, so for example for a 2-shot prompt,
we would have 2 inputs and 2 outputs from the examples,
and 1 input for the actual question, totalling 5 ⋅ 54 = 270
features.
Prompted data Test data
24 - 07 - 22 22 ebem@ws.edu
hasPunctuation isNumeric hasDot
Figure 2: Illustration of the matching requirements for mak- hasDigits hasAt
startWithDigit startLower
ing a train-test partition for the assessor. Each column repre-
sents a data wrangling prompt used to evaluate a LM. Orange Figure 3: Example of metafeatures that can be extracted from
columns represent instances included in the training set for the examples of different domains (dates and emails in the
the assessor, while green represents those included in the figure). Adapted from [29].
test set for the assessor. To avoid contamination, the same
instances should be used across different shots and systems.
3.3.3. Score
happen. The order-matched train-test partition leads to For the data wrangling tasks, all scores are binary: 1 if
194k training instances and 38k testing instances. the output of the LM matches the target string exactly,
and 0 otherwise. The score is what the assessors must
3.3. Anatomy of the Evaluation Record predict, and thus acts as a label during training.
From our two data sources, we receive records of the
shape ⟨system id, #shots, prompt, score⟩. We further an- 3.4. Assessor Building and Evaluation
notate this record with features describing the system For the assessor model, we train a Random Forest [36]
(𝜋), and extract meta features of the instance (𝜇) that are of 100 decision trees, a minimum node size of 5, and
fit for tabular representation (as opposed to free form select randomly 50% of the available variables in each
text). In the end, this creates a general record of the shape
split, tuned through grid search on a validation set us-
⟨𝜋, 𝜇, 𝑟⟩ = ⟨⟨system features⟩, ⟨instance features⟩, score⟩.ing 84%-16% training-validation split4 from the training
We describe these features in detail below, but ultimately set defined previously. For the remaining hyperparame-
the only constraint for making a useful assessor is that ters the defaults were used5 . We report the Area Under
all system and instance features are available without Receiver Operating Characteristic Curve (AUROC) and
actually running the original model. Brier Score (BS), as well as its decomposition into cali-
bration and refinement loss [37, 38, 39]
3.3.1. System features As a baseline to compare assessors to, we take the
standard approach of interpreting the probability 𝑝(𝑦|𝑥)̂
The available system features include a system id that
the LM assigns to its output 𝑦̂ as the “confidence” of the
refers to a specific trained LM, i.e., a set of learned pa-
model, i.e. its self-assessed probability of being correct.
rameters fitting a certain architecture, the id of that ar-
However, there is no data 𝑝(𝑦|𝑥) ̂ recorded in the BIG-
chitecture (either GPT-3, BIG-G sparse, or BIG-G dense),
bench logs, so we cannot compare the assessor AUROC
whether a model is dense or sparse, and the number of
or BS to those of the BIG-G family of models. For GPT-3
parameters. These features will of course be the same
this information is available.
for all records of the same trained model.
3.3.2. Instance features 4. Results and Discussion
Instance features include the number of shots, the id of Since it is assumed that all models give better results with
the prompt-template3 , and 54 simple binary metafeatures 𝑛 + 1 shots than with 𝑛 shots, Figure 4 shows the accu-
that can be automatically extracted through simple regu- racies of the LMs, with the maximum number of shots
lar expressions from the original text. Examples include
the kind of symbols the instance contains (e.g., numbers, 4
The non-standard train-test partition is the result of the in-
dots, dashes) or whether it starts with a digit (see Figure stance matching procedure described in section 3.2.
5
The RandomForest package (https://cran.r-project.org/web/
3
The prompt-template differs between [10] and [34], but the packages/randomForest/index.html) was used for training the asses-
same metafeatures can be extracted. sor model.
Table 1
AUROC and BS (Calibration, Refinement) in the GPT-3 data, for a single assessor trained with both GPT-3 and BIG-G data
using all available features (except prompt-template id), alongside the self-estimation from GPT-3 LMs. In the bottom row,
AUROC and BS are not averaged, but calculated from the aggregated set of instances. The average accuracies from GPT-3
LMs (with std. dev. across #shots) on the original data-wrangling task are also presented, and serve as an indication of the
class distribution the assessor has to deal with.
𝑅̂ GPT-3 self-estimation
id LM Acc.
AUROC BS (CAL, REF) AUROC BS (CAL, REF)
GPT-3 Ada 350M 0.524±0.232 0.901 0.144 (0.033, 0.111) 0.908 0.122 (0.005, 0.117)
GPT-3 Babbage 1.3B 0.580±0.240 0.914 0.141 (0.036, 0.106) 0.920 0.116 (0.004, 0.102)
GPT-3 Curie 6.7B 0.625±0.244 0.918 0.130 (0.025, 0.105) 0.934 0.108 (0.011, 0.097)
GPT-3 Davinci 175B 0.689±0.253 0.917 0.125 (0.022, 0.099) 0.944 0.096 (0.008, 0.087)
Aggregated 0.604±0.262 0.916 0.135 (0.024, 0.111) 0.929 0.110 (0.005, 0.105)
used in BIG-G data (3-shots), the same number of shots Table 2
for GPT-3 (for comparability), and the maximum used AUROC and BS (Calibration, Refinement) for BIG-G data
(10-shots) on the original data-wrangling tasks. Despite using a single assessor trained with both GPT-3 and BIG-G
the promising progress in the state-of-the-art capabili- data using all available features (except prompt-template id).
ties of LMs, they still struggle to master data-wrangling In the bottom row, AUROC and BS are not averaged, but
tasks with very few shots. For 3-shot inference, BIG-G calculated from the aggregated set of instances. The average
accuracies of the LM (with std. dev. across #shots) on the
dense 128B achieves an accuracy of 0.776, outperforming
original data-wrangling task are also presented, and serve as
BIG-G sparse 8B and GPT-3 175B. When it comes to 10- an indication of the class distribution the assessor has to deal
shots (only GPT-3 was available), GPT-3 175B achieves a with.
promising accuracy of nearly 90%, outperforming other
𝑅̂
GPT-3 variants. id LM Acc.
AUROC BS (CAL, REF)
175B BIG-G sparse 2M 0.018±0.008 0.919 0.005 (0.002, 0.003)
BIG-G sparse 16M 0.054±0.015 0.636 0.022 (0.007, 0.015)
128B
1.3B
6.7B
8B
BIG-G sparse 53M 0.103±0.044 0.747 0.064 (0.020, 0.044)
0.75 27B
2B 4B 8B
BIG-G sparse 125M 0.250±0.144 0.833 0.121 (0.044, 0.077)
350M 175B
1B
4B
BIG-G sparse 244M 0.330±0.199 0.844 0.135 (0.051, 0.084)
350M 1B 2B
6.7B BIG-G sparse 422M 0.376±0.232 0.840 0.151 (0.064, 0.084)
BIG-G sparse 1B 0.445±0.267 0.867 0.147 (0.061, 0.087)
Accuracy
422M 1.3B
0.50
244M 422M BIG-G sparse 2B 0.479±0.296 0.885 0.139 (0.060, 0.079)
125M 244M
BIG-G sparse 4B 0.491±0.302 0.877 0.148 (0.060, 0.088)
BIG-G sparse 8B 0.533±0.325 0.866 0.155 (0.057, 0.098)
0.25 125M BIG-G dense 2M 0.013±0.001 0.919 0.011 (0.003, 0.008)
53M a BIG−G dense (3−shot) BIG-G dense 16M 0.051±0.012 0.781 0.020 (0.009, 0.010)
a BIG−G sparse (3−shot)
a GPT−3 (10−shot) BIG-G dense 53M 0.118±0.056 0.741 0.073 (0.018, 0.055)
16M 53M
2M a GPT−3 (3−shot) BIG-G dense 125M 0.207±0.117 0.809 0.117 (0.047, 0.070)
16M
0.00 2M BIG-G dense 244M 0.291±0.179 0.836 0.133 (0.047, 0.086)
101 102 103 104 105 BIG-G dense 422M 0.331±0.203 0.784 0.165 (0.065, 0.100)
Parameters (M) BIG-G dense 1B 0.407±0.258 0.853 0.154 (0.073, 0.081)
BIG-G dense 2B 0.447±0.276 0.834 0.173 (0.069, 0.104)
BIG-G dense 4B 0.479±0.292 0.875 0.147 (0.057, 0.090)
Figure 4: LMs’ accuracies per LM and size on the original BIG-G dense 8B 0.493±0.304 0.869 0.151 (0.049, 0.102)
BIG-G dense 27B 0.516±0.321 0.883 0.144 (0.058, 0.086)
data-wrangling task. Logarithmic scale used on the 𝑥-axis. BIG-G dense 128B 0.574±0.353 0.857 0.164 (0.059, 0.104)
Aggregated (BIG-G sparse) 0.308±0.275 0.894 0.109 (0.012, 0.097)
Aggregated (BIG-G dense) 0.328±0.273 0.884 0.121 (0.015, 0.106)
Table 1 describes the AUROC and BS —decomposed
into calibration loss (CAL) and refinement loss (REF)—
for the GPT-3 data given by an assessor trained with instance features on the performance of the assessor.
all features available except for the prompt-template Analysing the results in Table 1 and Table 2, we see
id, along with the GPT-3’s self-assessment on the same relatively good results overall in the assessor’s perfor-
set of instances. Table 2 describes the AUROC and BS mance, reporting AUROCs of around 0.9, and BSs around
—decomposed into calibration loss (CAL) and refinement 0.12 (Q1). It should be noted that the metrics for the
loss (REF)— for the BIG-G data given by the assessor. smallest LMs have to be interpreted cautiously due to the
Finally, Table 3 shows the impact of various system and significant imbalance in LM scores distribution (i.e., for
Table 3 tecture is not indicative of major performance differences
Ablation study of the impact of various features on assessor (or the assessor fails to pick up on them).
performance. The 54 instance metafeatures are always in- Finally, we discuss a concrete example using the as-
cluded. Row 4, in italics, indicates the assessor we reported insessor to implement a reject rule (see Table ??). For the
Table 1 and Table 2. GPT-3 data in the test set (24604 instances), we take a
reject threshold of 1%, i.e., we reject instances where the
s.
s
er
ar
i d
et
assessor deems it is less than 1% likely the LM would suc-
sp
id
e
m
at
em
s
.&
ra
ot
pl
ceed. The assessor rejects about 5340 instances, which
pa
sh
m
m
st
AUROC (𝑅)̂ BS (CAL, REF) (𝑅)̂
sy
fa
te
#
1 • # 0.909 0.130 (0.015, 0.115) account for 21.7% of the instances and (approximately)
2 • • 0.910 0.130 (0.015, 0.115)
3 • • • 0.912 0.127 (0.014, 0.113)
the total compute. From these 5340, we have that 5114
4 • • • • 0.916 0.128 (0.017, 0.111) are correctly rejected, representing 46% of the failures,
5 • • • 0.917 0.126 (0.015, 0.111)
6 • • 0.916 0.126 (0.015, 0.111)
at the cost of only 226 correct answers being rejected
7 • • • 0.916 0.128 (0.016, 0.112) (about 1.5%).
8 • • 0.869 0.167 (0.030, 0.137) Therefore, a lot of compute, money, and emissions
9 • • 0.868 0.168 (0.031, 0.137)
10 • 0.865 0.170 (0.033, 0.137) would be saved since the assessor is far smaller than the
LMs in terms of parameters and inference time. Con-
cretely, the proposed assessor has 100 decision trees of
very low accuracies it is easy to predict that the LM will (approximately) 20000 nodes, whose inference time is in
usually fail). We also see that, in general, performance the order of 100 ⋅ log2 (20000) ≈ 1450 comparisons, much
is worse for the BIG-G models than for the GPT-3 ones. smaller than what LMs required for one pass through its
This could be due to GPT-3 being more predictable, or billions parameters.
from the availability of more data for GPT-3 (possibly
making the assessor pay more attention to the major- Table 4
ity model family in its generalisation). This observation Confusion matrix with reject threshold < 0.01 of assessor
suggests that the distribution of results of each system predictions for GPT-3. The 0 and 1 represent wrong and correct
affects the performance of the assessor accordingly. If we responses by the LM respectively.
would like to focus on building an assessor for a specific Actual
LM, techniques like instance weights or oversampling failure correct
could have an effect. failure 5114 226
Predicted
Comparing these results with GPT-3’s self-assessment correct 6004 13481
in Table 1, we can conclude that the assessor performs
slightly worse than GPT-3, but is definitely comparable
(Q2). A significant part of the difference in BS comes
from the calibration (CAL) term, and not from the re- 5. Conclusions and Future work
finement (REF) term, which is very similar for the LMs
and the assessor, especially for the smaller versions of We have illustrated how a small assessor can manage
GPT-3. This suggests that post-hoc calibration methods performance expectations at a level that is comparable
[40] like isotonic regression could still improve results to the self-assessment of giant language models with
significantly. billions of parameters. We have shown the assessor can
In the feature importance study in Table 3, we can be well calibrated and make refined predictions. We find
see that using either the system id or the number of that the assessor picks up on system features like id or #
parameters improves performance significantly, likely parameters that explain large variances in performance.
because both can indicate the scale of the system, which We showcase how they can be used to reject instances
highly correlates with performance. The use of system id before running much larger language models, resulting
generalises slightly worse than #parameters. Other fea- in a significant saving of compute.
tures, like #shots, prompt-template id, or model family There are of course some limitations to this work. For
and sparsity indicators have less effect on the perfor- example, the instance metafeatures are specific to the
mance (Q3). The assessor can easily derive the #shots used data-wrangling tasks. Nonetheless, the positive
from the input (more examples results in more features results hint at future work. There are still many ways of
being present), so this makes sense. We did not measure directly improving the assessor we have used here. For
any effect on aggregated performance from the different instance, we could use post-hoc calibration with methods
prompt-templates, and it is likely this feature is simply such as isotonic regression, or add instance weights to
non-informative. Regarding model family and sparsity, the results of systems we especially care about. There are
we hypothesise that there is a large overlap between also many questions to further investigate. Do assessors
which instances the LMs solve correctly, so model archi- work for other tasks? Can we use a small LM instead
of a random forest to allow free form input? What is [4] E. Kharitonov, A. Lee, A. Polyak, Y. Adi,
the agreement between different systems, and with the J. Copet, K. Lakhotia, T.-A. Nguyen, M. Rivière,
assessor? A. Mohamed, E. Dupoux, W.-N. Hsu, Text-
These future ideas could be useful from the perspective Free Prosody-Aware Generative Spoken Language
of saving computing costs as we outlined before, but the Modeling, arXiv:2109.03264 [cs, eess] (2021).
schema is of wider applicability. There is a lot of useful arXiv:2109.03264 .
information generated during the evaluation process that [5] J. W. Rae, S. Borgeaud, T. Cai, K. Millican, J. Hoff-
is lost upon aggregation. Assessors are an attempt at mann, F. Song, J. Aslanides, S. Henderson, R. Ring,
capturing this information and providing expectation S. Young, et al., Scaling language models: Methods,
management that is external, fine grained, anticipative, analysis & insights from training Gopher, arXiv
and can make use of population data. We could use them preprint arXiv:2112.11446 (2021).
as instance-level model selectors, or we might be able [6] D. Hendrycks, C. Burns, S. Basart, A. Zou,
apply explainability techniques on the assessor to find M. Mazeika, D. Song, J. Steinhardt, Measuring
out what makes an instance difficult. massive multitask language understanding, arXiv
There is definitely more to explore around the topic of preprint arXiv:2009.03300 (2020).
assessors, which perform granular assessments beyond [7] D. Hendrycks, S. Basart, S. Kadavath, M. Mazeika,
generic aggregated results: saving compute by rejecting A. Arora, E. Guo, C. Burns, S. Puranik, H. He,
examples where the original model is going to fail is an D. Song, J. Steinhardt, Measuring coding challenge
important illustrative application. competence with APPS, 2021. arXiv:2105.09938 .
[8] R. Bommasani, et al., On the opportunities
and risks of foundation models, arXiv preprint
Acknowledgments arXiv:2108.07258, 2021.
[9] L. Ouyang, J. Wu, X. Jiang, D. Almeida, C. L. Wain-
We thank the anonymous reviewers for their comments.
wright, P. Mishkin, C. Zhang, S. Agarwal, K. Slama,
This work has been partially supported by the Norwe-
A. Ray, et al., Training language models to follow
gian Research Council grant 329745 Machine Teach-
instructions with human feedback, arXiv preprint
ing for Explainable AI, also by the EU (FEDER) and
arXiv:2203.02155 (2022).
Spanish MINECO grant RTI2018-094403-B-C32 funded
[10] A. Srivastava, A. Rastogi, et al., Beyond the imi-
by MCIN/AEI/10.13039/501100011033 and by “ERDF A
tation game: Quantifying and extrapolating the
way of making Europe”, Generalitat Valenciana under
capabilities of language models, 2022. URL: https:
grant PROMETEO/2019/098, EU’s Horizon 2020 research
//arxiv.org/abs/2206.04615. doi:10.48550/ARXIV.
and innovation programme under grant agreement No.
2206.04615 .
952215 (TAILOR), US DARPA HR00112120007 (RECoG-
[11] R. Herbei, M. H. Wegkamp, Classification with
AI), and INNEST/2021/317 (Project cofunded by the Eu-
reject option, The Canadian Journal of Statistics/La
ropean Union with the “Programa Operativo del Fondo
Revue Canadienne de Statistique (2006) 709–721.
Europeo de Desarrollo Regional (FEDER) de la Comunitat
[12] F. Tortorella, An optimal reject rule for binary clas-
Valenciana 2014-2020”) and ”the UPV (Vicerrectorado de
sifiers, in: Joint IAPR International Workshops on
Investigación) grant PAI-10-21”.
Statistical Techniques in Pattern Recognition (SPR)
and Structural and Syntactic Pattern Recognition
References (SSPR), Springer, 2000, pp. 611–620.
[13] K. Hendrickx, L. Perini, D. Van der Plas, W. Meert,
[1] J. Devlin, M.-W. Chang, K. Lee, K. Toutanova, J. Davis, Machine learning with a reject option: A
Bert: Pre-training of deep bidirectional transform- survey, arXiv preprint arXiv:2107.11277 (2021).
ers for language understanding, arXiv preprint [14] Z. Jiang, J. Araki, H. Ding, G. Neubig, How
arXiv:1810.04805 (2018). Can We Know When Language Models Know?
[2] C. Raffel, N. Shazeer, A. Roberts, K. Lee, S. Narang, On the Calibration of Language Models for Ques-
M. Matena, Y. Zhou, W. Li, P. J. Liu, Exploring the tion Answering, Transactions of the Association
limits of transfer learning with a unified text-to- for Computational Linguistics 9 (2021) 962–977.
text transformer, arXiv preprint arXiv:1910.10683 doi:10.1162/tacl_a_00407 .
(2019). [15] J. Hernández-Orallo, W. Schellaert, F. Martınez-
[3] T. B. Brown, B. Mann, N. Ryder, M. Subbiah, J. Ka- Plumed, Training on the test set: Mapping the
plan, P. Dhariwal, A. Neelakantan, P. Shyam, G. Sas- system-problem space in AI, Proceedings of the
try, A. Askell, et al., Language models are few-shot AAAI Conference on Artificial Intelligence (2022).
learners, in: Advances in Neural Information Pro- [16] S. Kandel, J. Heer, C. Plaisant, J. Kennedy,
cessing Systems, volume 33, 2020, pp. 1877–1901. F. Van Ham, N. H. Riche, C. Weaver, B. Lee, D. Brod-
beck, P. Buono, Research directions in data wran- on Databases, Springer, 2017, pp. 36–48.
gling: Visualizations and transformations for us- [28] S. Gulwani, J. Hernández-Orallo, E. Kitzelmann,
able and credible data, Information Visualization S. H. Muggleton, U. Schmid, B. Zorn, Inductive
10 (2011) 271–288. programming meets the real world, Communica-
[17] T. Furche, G. Gottlob, L. Libkin, G. Orsi, N. Paton, tions of the ACM 58 (2015) 90–99.
Data wrangling for big data: Challenges and op- [29] L. Contreras-Ochando, C. Ferri, J. Hernández-
portunities, in: Advances in Database Technol- Orallo, F. Martínez-Plumed, M. J. Ramírez-
ogy—EDBT 2016: Proceedings of the 19th Interna- Quintana, S. Katayama, Automated data transfor-
tional Conference on Extending Database Technol- mation with inductive programming and dynamic
ogy, 2016, pp. 473–478. background knowledge, in: Joint European Confer-
[18] G. Jaimovitch López, Comparison between machine ence on Machine Learning and Knowledge Discov-
learning and human learning from examples gener- ery in Databases, Springer, 2019, pp. 735–751.
ated with machine teaching, 2020. [30] S. Kandel, A. Paepcke, J. Hellerstein, J. Heer, Wran-
[19] W. Zeng, X. Ren, T. Su, H. Wang, Y. Liao, Z. Wang, gler: Interactive visual specification of data trans-
X. Jiang, Z. Yang, K. Wang, X. Zhang, et al., formation scripts, in: Proceedings of the sigchi con-
Pangu-𝛼: Large-scale autoregressive pretrained chi- ference on human factors in computing systems,
nese language models with auto-parallel computa- 2011, pp. 3363–3372.
tion, arXiv preprint arXiv:2104.12369 (2021). [31] T. De Bie, L. De Raedt, J. Hernández-Orallo, H. H.
[20] N. Du, Y. Huang, A. M. Dai, S. Tong, D. Lepikhin, Hoos, P. Smyth, C. K. I. Williams, Automating data
Y. Xu, M. Krikun, Y. Zhou, A. W. Yu, O. Firat, science, Communications of the ACM 65 (2022)
B. Zoph, L. Fedus, M. Bosma, Z. Zhou, T. Wang, Y. E. 76–87. doi:10.1145/3495256 .
Wang, K. Webster, M. Pellat, K. Robinson, K. Meier- [32] R. Puri, B. Catanzaro, Zero-shot text classification
Hellstern, T. Duke, L. Dixon, K. Zhang, Q. V. Le, with generative language models, arXiv preprint
Y. Wu, Z. Chen, C. Cui, GLaM: Efficient Scal- arXiv:1912.10165 (2019).
ing of Language Models with Mixture-of-Experts, [33] T. Schick, H. Schütze, Exploiting cloze questions
arXiv:2112.06905 [cs] (2021). arXiv:2112.06905 . for few shot text classification and natural language
[21] S. Zhang, S. Roller, N. Goyal, M. Artetxe, M. Chen, inference, arXiv preprint arXiv:2001.07676 (2020).
S. Chen, C. Dewan, M. Diab, X. Li, X. V. Lin, T. Mi- [34] G. Jaimovitch-Lopez, C. Ferri, J. Hernandez-Orallo,
haylov, M. Ott, S. Shleifer, K. Shuster, D. Simig, P. S. F. Martınez-Plumed, M. J. Ramırez-Quintana, Can
Koura, A. Sridhar, T. Wang, L. Zettlemoyer, OPT: language models automate data wrangling?, in:
Open Pre-trained Transformer Language Models, Workshop on Automating Datascience at ECML-
arXiv:2205.01068 [cs] (2022). arXiv:2205.01068 . PKDD, 2021, p. 13.
[22] J. Kaplan, S. McCandlish, T. Henighan, T. B. Brown, [35] B. Zoph, I. Bello, S. Kumar, N. Du, Y. Huang, J. Dean,
B. Chess, R. Child, S. Gray, A. Radford, J. Wu, N. Shazeer, W. Fedus, Designing effective sparse
D. Amodei, Scaling laws for neural language expert models, arXiv preprint arXiv:2202.08906
models, CoRR abs/2001.08361 (2020). URL: https: (2022).
//arxiv.org/abs/2001.08361. arXiv:2001.08361 . [36] L. Breiman, Random forests, Machine learning 45
[23] R. Desislavov, F. Martínez-Plumed, J. Hernández- (2001) 5–32.
Orallo, Compute and energy consumption trends [37] A. Murphy, A New Vector Partition of the Prob-
in deep learning inference, arXiv preprint ability Score., Journal of Applied Meteorology 12
arXiv:2109.05472 (2021). (1973) 595–600.
[24] O. Sharir, B. Peleg, Y. Shoham, The cost of training [38] P. Flach, E. Matsubara, On classification, ranking,
NLP models: A concise overview, arXiv preprint and probability estimation, in: Dagstuhl Seminar
arXiv:2004.08900 (2020). Proceedings, Schloss Dagstuhl-Leibniz-Zentrum fr
[25] B. Dickson, The GPT-3 economy, Informatik, 2008.
https://bdtechtalks.com/2020/09/21/ [39] J. Hernández-Orallo, P. Flach, C. Ferri Ramírez, A
gpt-3-economy-business-model/, 2020. unified view of performance metrics: Translating
[26] D. Steinberg, How much time needs threshold choice into expected classification loss,
to be spent preparing data for analysis?, Journal of Machine Learning Research 13 (2012)
http://info.salford-systems.com/blog/bid/299181/ 2813–2869.
How-Much-Time-Needs-to-be-Spent-Preparing-Data/[40] A. Bella, C. Ferri, J. Hernández-Orallo, M. J. Ramírez-
/-for-Analysis (2013). Quintana, Calibration of machine learning models,
[27] A. Bogatu, N. W. Paton, A. A. Fernandes, Towards in: Handbook of Research on Machine Learning Ap-
automatic data format transformations: Data wran- plications and Trends: Algorithms, Methods, and
gling at scale, in: British International Conference Techniques, IGI Global, 2010, pp. 128–146.