=Paper= {{Paper |id=Vol-3169/paper4 |storemode=property |title=Reject Before You Run: Small Assessors Anticipate Big Language Models |pdfUrl=https://ceur-ws.org/Vol-3169/paper4.pdf |volume=Vol-3169 |authors=Lexin Zhou,Fernando Martínez-Plumed,José Hernández-Orallo,Cèsar Ferri,Wout Schellaert |dblpUrl=https://dblp.org/rec/conf/ijcai/ZhouMHFS22 }} ==Reject Before You Run: Small Assessors Anticipate Big Language Models== https://ceur-ws.org/Vol-3169/paper4.pdf
Reject Before You Run: Small Assessors Anticipate Big
Language Models
Lexin Zhou1 , Fernando Martínez-Plumed1 , José Hernández-Orallo1,2 , Cèsar Ferri1 and
Wout Schellaert1
1
    Valencian Research Institute for Artificial Intelligence (VRAIN), Universitat Politècnica de València
2
    Leverhulme Centre for the Future of Intelligence, University of Cambridge


                                             Abstract
                                             Large Language Models (LMs) are expensive to operate. It would be more frugal to avoid querying them when results are
                                             predictably bad. In this paper we therefore investigate whether it is possible to granularly predict the performance of these
                                             large LMs with a much smaller external model, the assessor, which is trained on evaluation results. For instance, given an
                                             input prompt, can an assessor estimate the probability of correct completion by a giant like GPT-3 Davinci (175B parameters)?
                                             Using a data-wrangling task included in the BIG-bench repository as a case study, we find it is indeed possible, and we report
                                             results that are comparable in accuracy and calibration to the LM itself. This suggests that, at least for some tasks, a lot
                                             of compute, money, and emissions could be spared through the assessor’s anticipative reject option. It also suggests that
                                             assessors can capture meaningful extra information from the evaluation procedure, and as such, could be a useful complement
                                             to simple aggregate metrics.

                                             Keywords
                                             Assessor, Anticipative Reject Option, Language Model, Data Wrangling, AI Evaluation, Instance Granularity,



1. Introduction                                                                                                       1
                                                                                                                                Date & Time Stored as Text
                                                                                                                                Tue Mar 14 19:09:37 CDT 2021
                                                                                                                                                                Date & Time (CEST)
                                                                                                                                                                15/03/17 2:09:37 AM                 IN
                                                                                                                      2          Jul 2nd, 2019 13:37:37 (EDT)   2/07/19 19:37:37 PM    0.875


Extensive experimental research on Language Models                                                                    3

                                                                                                                      4
                                                                                                                                  June 1, 2020 CMT 18:07:26
                                                                                                                                    Apr 02 '17 : 0856 MDT
                                                                                                                                                                1/06/19 06:07:26 PM
                                                                                                                                                                02/04/02 05:08:56 PM
                                                                                                                                                                                       0.835

                                                                                                                                                                                       0.307

(LM) keeps showing remarkable results across several                                                                  5            20:12:20 GST (02/17/21)      20/12/20 06:17:21 AM   0.124   OUT
                                                                                                                          ...                ...                         ...
domains including mathematics, question answering, lan-                                                               65534      2020 Mar 14 11:15:45 AEST      14/03/20 11:15:45 AM   0.769

guage understanding, and code generation [1, 2, 3, 4, 5,                                                              65535        11/03/2020 04:30 (YAKT)      11/03/20 4:30:30 AM    0.455

                                                                                                                                Tue Feb 24 12:35:05 EEST 2022   24/02/22 11:35:05 AM                       LM
6, 7, 8, 9]. While the performance results for many tasks
                                                                                                                      65536                                                            0.910



are quickly improving –on average–, there is a high vari-                                                                       Date & Time Stored as Text      Date & Time (CEST)
                                                                                                                      2          Jul 2nd, 2019 13:37:37 (EDT)                          0.870
ance in the results depending on the particular task, the                                                             3           June 1, 2020 CMT 18:07:26                            0.832   IN

instances, and the prompts [10]. For a given task, one                                                                4             Apr 02 '17 : 0856 MDT                              0.250

                                                                                                                      5            20:12:20 GST (02/17/21)                             0.123   OUT
can partially deal with the variability across instances                                                                  ...                ...                         ...

through a traditional reject rule, where we abstain from                                                              65534      2020 Mar 14 11:15:45 AEST                             0.770             Assessor
                                                                                                                      65535        11/03/2020 04:30 (YAKT)                             0.315

using the model’s decision when the probability of its                                                                65536     Tue Feb 24 12:35:05 EEST 2022                          0.915


answer (i.e., its “confidence”) falls below a certain thresh-
old [11, 12, 13]. This requires a good calibration of the                                                             Figure 1: (Top) Process of a LM generating the solution for
model. However, even if LMs were well calibrated, and                                                                 a date transformation (repetitive) problem in a spreadsheet.
                                                                                                                      Once the user prompts one instance of the desired transfor-
they are generally not [10, 14], it would also still require
                                                                                                                      mation (row 1), the LM proceeds to transforming the rest of
actually running the inference. For large LMs this comes                                                              instances (rows 2 & onward). (Bottom) Process of an assessor
at a non-negligible cost per token, either in required in-                                                            that can reliably predict beforehand the performance of the
                                                                                                                      LM at the instance-level.
EBeM’22: Workshop on AI Evaluation Beyond Metrics, July 25, 2022,
Vienna, Austria
Envelope-Open lzhou@inf.upv.es (L. Zhou); fermarpl@dsic.upv.es
                                                                                                                      frastructure or through the price of the API. To avoid
(F. Martínez-Plumed); jorallo@dsic.upv.es (J. Hernández-Orallo);
cferri@dsic.upv.es (C. Ferri); wschell@vrain.upv.es (W. Schellaert)                                                   being wasteful, we explore how much we can anticipate
GLOBE https://lexzhou.github.io/ (L. Zhou); https://nandomp.github.io/                                                the level of success for a particular instance (or collection
(F. Martínez-Plumed); http://josephorallo.webs.upv.es/                                                                of instances), without running it through the LM at all.
(J. Hernández-Orallo); http://personales.upv.es/ceferra/ (C. Ferri);                                                     For this, we need an assessor: an external conditional
https://schellaert.org/ (W. Schellaert)
                                                                                                                      probability (or density) estimator that can reliably pre-
Orcid 0000-0003-1161-4270 (L. Zhou); 0000-0003-2902-6477
(F. Martínez-Plumed); 0000-0001-9746-7632 (J. Hernández-Orallo);                                                      dict beforehand the performance of an LM at instance
0000-0002-8975-1120 (C. Ferri); 0000-0002-9182-4747 (W. Schellaert)                                                   granularity [15]. With a good assessor, we could make
                                       © 2022 Copyright for this paper by its authors. Use permitted under Creative
                                       Commons License Attribution 4.0 International (CC BY 4.0).                     the calculation of whether it is actually worth asking the
    CEUR
    Workshop
    Proceedings
                  http://ceur-ws.org
                  ISSN 1613-0073
                                       CEUR Workshop Proceedings (CEUR-WS.org)
LM for an answer, depending on factors such as the value     of supplied examples. For instance, 5-shot inference is
of a correct result, the cost of running the model, and of   usually better than 2-shot inference, but requires more
course the performance estimated by the assessor. See        effort from the user.
Figure 1 for an illustrative example.                           However, on many occasions the cost of running LMs
   This paper describes a (successful) attempt at building   is not negligible in both computational [22, 23] and eco-
such an assessor for a collection of large LMs consisting    nomic terms [24]. Large LMs, open source or not, all have
of various scales of GPT-3 [3] and BIG-G [10] models         steep development costs in common. A recent study [24]
for application to a diverse set of data-wrangling tasks.    puts the cost of developing a LM with only 1.5 billion
Data-wrangling [16, 17] is a notoriously time consuming      parameters at $1.6 million. Inference costs is another
data preparation chore where LMs have recently shown         drain. [25] estimates the cost of running GPT-3, if run in
promising results [18]. We discuss data-wrangling and        the cloud, at a minimum of $87,000 per year, with current
the considerations regarding the use of LMs in section       API price for Davinci being 6 cents per 750 words1 . Of
2.1 and 2.2.                                                 course, these costs go down quickly as compute becomes
                                                             cheaper, but larger models are expected to replace the
     Contributions                                           old ones quickly to set the new state of the art. Also, as
   To our knowledge, this is the first paper analysing       LMs increase their performance, their penetration rates
   assessors applied to the language domain, and to a        will increase, becoming widespread in billions of semi-
   plausible use case in general. Additionally,              automated operations in many domains, and compute
        • We find that lightweight assessors can give        might easily become more of an issue, not less.
          reliable instance-level predictions of the
          performance of large LMs.                          2.2. Data Wrangling
        • We find that their predictions are
                                                             Data-wrangling [16, 17] is a data preparation task that
          well-calibrated and unbiased, again
          comparable to the self-assessment of the           data janitors, data scientists and other people operating
          LMs.                                               with forms, spreadsheets and other data formatting sit-
                                                             uations consider a very monotonous and laborious part
        • We investigate the contributions of various
                                                             of their jobs.Data wrangling can require as much as 80
          features like #shots and #parameters to
          assessor performance.                              percent of their time [26], including tediously transform-
                                                             ing data presented from heterogeneous formats into a
                                                             standardised format for efficient access, understanding,
                                                             and analysis. One of the challenges in data-wrangling
                                                             automation consists of selecting the correct (string) trans-
2. Background                                                formations from the vast set of possible ones, and doing
                                                             so by only having seen a few examples [27]. Many ap-
In this section we revisit some key ideas of LMs, their      proaches have attempted to address this challenge by
costs, their applications to the data wrangling problem,     reducing the transformation space through the incorpo-
and the traditional (post-hoc) reject option. We also sum-   ration of prior knowledge [28, 29]. This led to a many
marise the main elements of the recently introduced con-     tools that use domain-specific languages or needing ad-
cept of assessor models.                                     hoc solutions [30].
                                                                Because LMs capture vast amounts of human knowl-
2.1. (Large) Language Models                                 edge across many different domains, they can be specially
                                                             effective for more open-ended tasks, and as such data
In less than a decade, research in Natural Language Pro-     wrangling is recognised in data science automation [31].
cessing (NLP) has been overturned by the appearance          Using few-shot inference [3, 32, 33], LMs have shown
of a suite of LMs trained in an unsupervised manner on       promising yet unreliable results for data wrangling. For
very large corpora. LMs are capturing more and more          instance, in [18] GPT-3 Davinci (prompted, not finetuned)
of the information in natural language, including the lin-   achieves a 56% accuracy in the 1-shot setting, 68% with
guistic characteristics of various human languages and       the 4-shot setting, and almost 90% with 10 shots. Addi-
associated knowledge. Moreover, these models can be          tionally, as opposed to LM results on other tasks, GPT-3
adapted (e.g., through fine-tuning) to a wide range of       is also relatively well calibrated in the data-wrangling
downstream tasks [8]. Recent LMs such as GPT-3 [3],          task, reporting a Brier score of 0.11 (see section 4).
PanGu-𝛼 [19], GLaM [20] and OPT [21] have excelled at
few-shot inference, where a task is solved by supplying a
small set of correct examples formatted as a prompt. The
quality of the completion usually depends on the number          1
                                                                     https://openai.com/api/pricing/
2.3. Reject option                                             3. Methods
Given these unreliable accuracies but good calibration         In this section we identify the experimental setting, in-
scores, we could have a more reliable and effective use        cluding goals of the analysis, the data sources and how
of these systems by not using those for which the confi-       they are converted into evaluation records, and how we
dence of the system is low. In other words, if we know         build the assessor from them.
for which instances the LM is (likely to be) wrong, we
can abstain from using the output of the LM in these
cases. This is called a ‘reject option’, and a classic and     3.1. Experimental Questions
straightforward implementation for it is to use a confi-       We set three experimental questions:
dence threshold 𝑡 and compare it with the probability
𝑝(𝑦|𝑥)
    ̂ that a model 𝑝 assigns to its output 𝑦.̂ This repre-       Q1: Can we build lightweight yet good assessors for
sents the self-assigned probability of being correct (i.e.,          language models in this domain?
its confidence) [11, 12, 13]. We set 𝑡 to match the error
tolerance of the use case, and when 𝑝(𝑦|𝑥)  ̂   < 𝑡 , we do      Q2: Are the assessors of comparable quality to the
not use the output of the model, usually delegating to a             language models when estimating probabilities?
human. However, this classical interpretation of the re-         Q3: What features from the systems and the instances
ject rule still requires running the model. As mentioned             are most relevant for predicting success and con-
before, this can be expensive for large LMs. Whenever                sequently for building good assessors?
the reject rule triggers, it is not only that humans need
to do the task manually, but we have also incurred a cost
in the computation of a model that is effectively wasted.      3.2. Data Sources and Train-Test Split
                                                               We work with the Data Wrangling Dataset Repository2 ,
2.4. Assessors                                                 containing 119 tasks from 7 domains (dates, emails, free
                                                               text, names, phones, times, and units). In particular, we
Assessor models [15] provide an external anticipative          use results (at instance level) from multiple LMs obtained
reject option instead. Assessors are conditional proba-        from two different evaluation efforts. First, [34] have
                                ̂
bility (or density) estimators 𝑅(𝑟|𝜋, 𝜇) that are trained on   produced granular results of the evaluation of different
evaluation data. With ‘evaluation data’ we mean a set of       versions of GPT-3. We have 146k instances available for
evaluation records ⟨𝜋, 𝜇, 𝑟⟩, where 𝜋 refers to a profile or   GPT-3 models Ada (350M), Babbage (1.3B), Curie (6.7B),
description of a particular system (e.g., deployment con-      and Davinci (175B), from 0-shot to 10-shot. More in-
ditions, state, system architecture, or hyperparameters),      formation about the architectures can be found in [3].
𝜇 refers to a particular instance (e.g., a prompt), and 𝑟 to   Second, [10] provides results on the same benchmark for
an empirical measurement of the performance of 𝜋 on 𝜇.         a collection of Google LMs of various parameter sizes.
   Assessor models are meant to act as general mappings        Here we extract 86k instances, from 0-shot to 3-shot,
between the space of systems, the space of instances,          for 22 models with parameter sizes ranging from 2M to
and the corresponding distribution of scores. They are a       128B across two different model families, a decoder-only
way of capturing all available evaluation information in       dense transformer (BIG-G dense) and a sparse Mixture-
a single predictive model that could be used, e.g., to in-     of-Experts [35] model (BIG-G sparse). More information
vestigate what features make an instance difficult, to add     on the BIG-G network architectures is available in [10].
confidence capabilities to systems that do not have them,      All models (GPT-3 and BIG-G variants) were queried with
or to select the optimal model for a specific instance. In     temperature set to 0, and none of them were fine-tuned
this case, we focus on their use to provide an anticipa-       for the data-wrangling task.
tive reject option: when 𝑅̂ is built and shown to be an           As the assessor is trained on a somewhat heteroge-
accurate estimator, we can use it to make inferences on        neous collection of systems and instances, we have to be
the expected performance 𝑅(𝑟  ̂ = 1|𝜋, 𝜇) given a system 𝜋     careful to define a train-test partition of the evaluation
and instance 𝜇 (or a collection of those).                     results without contamination or information leakage.
   We do still have to run actual inference on the assessor,   To this purpose, we must ensure that the same instances
but as we show in the experiments, they have the possi-        are consistently used across systems and shots. For ex-
bility of being multiple orders of magnitude smaller than      ample, we have to avoid that the result of BIG-G dense
the LMs, allowing us to cheaply avoid any LM inference         with 2-shots on instance 𝑖 is in the training set, while
that is doomed to fail.                                        GPT-3 Ada’s result with 0-shots on the same instance is
                                                               in the test set. Figure 2 shows a visual representation of
                                                               the partition requirements that ensure that this does not
                                                                  2
                                                                      http://dmip.webs.upv.es/datawrangling/
                                                                  3 for an example). We refer to [29] for an overview. The
                                                                  binary metafeatures are available for all input and output
                                                                  that is in the prompt, so for example for a 2-shot prompt,
                                                                  we would have 2 inputs and 2 outputs from the examples,
                                                                  and 1 input for the actual question, totalling 5 ⋅ 54 = 270
                                                                  features.
                                                                             Prompted data                         Test data

                                                                   24 - 07 - 22              22               ebem@ws.edu

                                                                          hasPunctuation          isNumeric                    hasDot
Figure 2: Illustration of the matching requirements for mak-              hasDigits                                            hasAt
                                                                          startWithDigit                                       startLower
ing a train-test partition for the assessor. Each column repre-
sents a data wrangling prompt used to evaluate a LM. Orange       Figure 3: Example of metafeatures that can be extracted from
columns represent instances included in the training set for      the examples of different domains (dates and emails in the
the assessor, while green represents those included in the        figure). Adapted from [29].
test set for the assessor. To avoid contamination, the same
instances should be used across different shots and systems.

                                                                  3.3.3. Score
happen. The order-matched train-test partition leads to           For the data wrangling tasks, all scores are binary: 1 if
194k training instances and 38k testing instances.                the output of the LM matches the target string exactly,
                                                                  and 0 otherwise. The score is what the assessors must
3.3. Anatomy of the Evaluation Record                             predict, and thus acts as a label during training.

From our two data sources, we receive records of the
shape ⟨system id, #shots, prompt, score⟩. We further an-          3.4. Assessor Building and Evaluation
notate this record with features describing the system      For the assessor model, we train a Random Forest [36]
(𝜋), and extract meta features of the instance (𝜇) that are of 100 decision trees, a minimum node size of 5, and
fit for tabular representation (as opposed to free form     select randomly 50% of the available variables in each
text). In the end, this creates a general record of the shape
                                                            split, tuned through grid search on a validation set us-
⟨𝜋, 𝜇, 𝑟⟩ = ⟨⟨system features⟩, ⟨instance features⟩, score⟩.ing 84%-16% training-validation split4 from the training
We describe these features in detail below, but ultimately  set defined previously. For the remaining hyperparame-
the only constraint for making a useful assessor is that    ters the defaults were used5 . We report the Area Under
all system and instance features are available without      Receiver Operating Characteristic Curve (AUROC) and
actually running the original model.                        Brier Score (BS), as well as its decomposition into cali-
                                                            bration and refinement loss [37, 38, 39]
3.3.1. System features                                         As a baseline to compare assessors to, we take the
                                                            standard approach of interpreting the probability 𝑝(𝑦|𝑥)̂
The available system features include a system id that
                                                            the LM assigns to its output 𝑦̂ as the “confidence” of the
refers to a specific trained LM, i.e., a set of learned pa-
                                                            model, i.e. its self-assessed probability of being correct.
rameters fitting a certain architecture, the id of that ar-
                                                            However, there is no data 𝑝(𝑦|𝑥) ̂   recorded in the BIG-
chitecture (either GPT-3, BIG-G sparse, or BIG-G dense),
                                                            bench logs, so we cannot compare the assessor AUROC
whether a model is dense or sparse, and the number of
                                                            or BS to those of the BIG-G family of models. For GPT-3
parameters. These features will of course be the same
                                                            this information is available.
for all records of the same trained model.

3.3.2. Instance features                                          4. Results and Discussion
Instance features include the number of shots, the id of          Since it is assumed that all models give better results with
the prompt-template3 , and 54 simple binary metafeatures          𝑛 + 1 shots than with 𝑛 shots, Figure 4 shows the accu-
that can be automatically extracted through simple regu-          racies of the LMs, with the maximum number of shots
lar expressions from the original text. Examples include
the kind of symbols the instance contains (e.g., numbers,             4
                                                                        The non-standard train-test partition is the result of the in-
dots, dashes) or whether it starts with a digit (see Figure       stance matching procedure described in section 3.2.
                                                                      5
                                                                        The RandomForest package (https://cran.r-project.org/web/
   3
     The prompt-template differs between [10] and [34], but the   packages/randomForest/index.html) was used for training the asses-
same metafeatures can be extracted.                               sor model.
Table 1
AUROC and BS (Calibration, Refinement) in the GPT-3 data, for a single assessor trained with both GPT-3 and BIG-G data
using all available features (except prompt-template id), alongside the self-estimation from GPT-3 LMs. In the bottom row,
AUROC and BS are not averaged, but calculated from the aggregated set of instances. The average accuracies from GPT-3
LMs (with std. dev. across #shots) on the original data-wrangling task are also presented, and serve as an indication of the
class distribution the assessor has to deal with.

                                                                                                                 𝑅̂                               GPT-3 self-estimation
           id                                           LM Acc.
                                                                                          AUROC                 BS (CAL, REF)                 AUROC           BS (CAL, REF)
           GPT-3 Ada 350M                          0.524±0.232                             0.901               0.144 (0.033, 0.111)            0.908        0.122 (0.005, 0.117)
           GPT-3 Babbage 1.3B                      0.580±0.240                             0.914               0.141 (0.036, 0.106)            0.920        0.116 (0.004, 0.102)
           GPT-3 Curie 6.7B                        0.625±0.244                             0.918               0.130 (0.025, 0.105)            0.934        0.108 (0.011, 0.097)
           GPT-3 Davinci 175B                      0.689±0.253                             0.917               0.125 (0.022, 0.099)            0.944        0.096 (0.008, 0.087)
           Aggregated                              0.604±0.262                             0.916               0.135 (0.024, 0.111)            0.929        0.110 (0.005, 0.105)


used in BIG-G data (3-shots), the same number of shots                                                           Table 2
for GPT-3 (for comparability), and the maximum used                                                              AUROC and BS (Calibration, Refinement) for BIG-G data
(10-shots) on the original data-wrangling tasks. Despite                                                         using a single assessor trained with both GPT-3 and BIG-G
the promising progress in the state-of-the-art capabili-                                                         data using all available features (except prompt-template id).
ties of LMs, they still struggle to master data-wrangling                                                        In the bottom row, AUROC and BS are not averaged, but
tasks with very few shots. For 3-shot inference, BIG-G                                                           calculated from the aggregated set of instances. The average
                                                                                                                 accuracies of the LM (with std. dev. across #shots) on the
dense 128B achieves an accuracy of 0.776, outperforming
                                                                                                                 original data-wrangling task are also presented, and serve as
BIG-G sparse 8B and GPT-3 175B. When it comes to 10-                                                             an indication of the class distribution the assessor has to deal
shots (only GPT-3 was available), GPT-3 175B achieves a                                                          with.
promising accuracy of nearly 90%, outperforming other
                                                                                                                                                                       𝑅̂
GPT-3 variants.                                                                                                   id                           LM Acc.
                                                                                                                                                            AUROC     BS (CAL, REF)
                                                                                                        175B      BIG-G sparse 2M             0.018±0.008    0.919   0.005 (0.002, 0.003)
                                                                                                                  BIG-G sparse 16M            0.054±0.015    0.636   0.022 (0.007, 0.015)
                                                                                                  128B
                                                             1.3B
                                                                           6.7B
                                                                                    8B
                                                                                                                  BIG-G sparse 53M            0.103±0.044    0.747   0.064 (0.020, 0.044)
           0.75                                                                             27B
                                                                     2B    4B       8B
                                                                                                                  BIG-G sparse 125M           0.250±0.144    0.833   0.121 (0.044, 0.077)
                                                350M                                                    175B
                                                               1B
                                                                            4B
                                                                                                                  BIG-G sparse 244M           0.330±0.199    0.844   0.135 (0.051, 0.084)
                                                350M    1B            2B
                                                                                   6.7B                           BIG-G sparse 422M           0.376±0.232    0.840   0.151 (0.064, 0.084)
                                                                                                                  BIG-G sparse 1B             0.445±0.267    0.867   0.147 (0.061, 0.087)
Accuracy




                                                   422M       1.3B
           0.50
                                           244M         422M                                                      BIG-G sparse 2B             0.479±0.296    0.885   0.139 (0.060, 0.079)
                                     125M        244M
                                                                                                                  BIG-G sparse 4B             0.491±0.302    0.877   0.148 (0.060, 0.088)
                                                                                                                  BIG-G sparse 8B             0.533±0.325    0.866   0.155 (0.057, 0.098)
           0.25                                 125M                                                              BIG-G dense 2M              0.013±0.001    0.919   0.011 (0.003, 0.008)
                                    53M                                     a     BIG−G dense (3−shot)            BIG-G dense 16M             0.051±0.012    0.781   0.020 (0.009, 0.010)
                                                                            a     BIG−G sparse (3−shot)
                                                                            a     GPT−3 (10−shot)                 BIG-G dense 53M             0.118±0.056    0.741   0.073 (0.018, 0.055)
                              16M   53M
                  2M                                                        a     GPT−3 (3−shot)                  BIG-G dense 125M            0.207±0.117    0.809   0.117 (0.047, 0.070)
                              16M
           0.00        2M                                                                                         BIG-G dense 244M            0.291±0.179    0.836   0.133 (0.047, 0.086)
                            101           102            103                       104            105             BIG-G dense 422M            0.331±0.203    0.784   0.165 (0.065, 0.100)
                                                 Parameters (M)                                                   BIG-G dense 1B              0.407±0.258    0.853   0.154 (0.073, 0.081)
                                                                                                                  BIG-G dense 2B              0.447±0.276    0.834   0.173 (0.069, 0.104)
                                                                                                                  BIG-G dense 4B              0.479±0.292    0.875   0.147 (0.057, 0.090)
Figure 4: LMs’ accuracies per LM and size on the original                                                         BIG-G dense 8B              0.493±0.304    0.869   0.151 (0.049, 0.102)
                                                                                                                  BIG-G dense 27B             0.516±0.321    0.883   0.144 (0.058, 0.086)
data-wrangling task. Logarithmic scale used on the 𝑥-axis.                                                        BIG-G dense 128B            0.574±0.353    0.857   0.164 (0.059, 0.104)
                                                                                                                  Aggregated (BIG-G sparse)   0.308±0.275    0.894   0.109 (0.012, 0.097)
                                                                                                                  Aggregated (BIG-G dense)    0.328±0.273    0.884   0.121 (0.015, 0.106)
   Table 1 describes the AUROC and BS —decomposed
into calibration loss (CAL) and refinement loss (REF)—
for the GPT-3 data given by an assessor trained with                                                             instance features on the performance of the assessor.
all features available except for the prompt-template                                                               Analysing the results in Table 1 and Table 2, we see
id, along with the GPT-3’s self-assessment on the same                                                           relatively good results overall in the assessor’s perfor-
set of instances. Table 2 describes the AUROC and BS                                                             mance, reporting AUROCs of around 0.9, and BSs around
—decomposed into calibration loss (CAL) and refinement                                                           0.12 (Q1). It should be noted that the metrics for the
loss (REF)— for the BIG-G data given by the assessor.                                                            smallest LMs have to be interpreted cautiously due to the
Finally, Table 3 shows the impact of various system and                                                          significant imbalance in LM scores distribution (i.e., for
Table 3                                                         tecture is not indicative of major performance differences
Ablation study of the impact of various features on assessor    (or the assessor fails to pick up on them).
performance. The 54 instance metafeatures are always in-           Finally, we discuss a concrete example using the as-
cluded. Row 4, in italics, indicates the assessor we reported insessor to implement a reject rule (see Table ??). For the
Table 1 and Table 2.                                            GPT-3 data in the test set (24604 instances), we take a
                                                                reject threshold of 1%, i.e., we reject instances where the


                         s.
                          s
                      er




                     ar
                     i d
                   et

                                                                assessor deems it is less than 1% likely the LM would suc-
                  sp
                  id




                   e
                 m

                at
        em




                s
              .&
              ra




              ot
             pl


                                                                ceed. The assessor rejects about 5340 instances, which
           pa




           sh
            m
            m
      st




                              AUROC (𝑅)̂   BS (CAL, REF) (𝑅)̂
    sy




         fa
         te
             #




  1    • #                       0.909     0.130 (0.015, 0.115) account for 21.7% of the instances and (approximately)
  2    •                  •      0.910     0.130 (0.015, 0.115)
  3    •             •    •      0.912     0.127 (0.014, 0.113)
                                                                the total compute. From these 5340, we have that 5114
  4    •    •        •    •      0.916     0.128 (0.017, 0.111) are correctly rejected, representing 46% of the failures,
  5         •        •    •      0.917     0.126 (0.015, 0.111)
  6         •             •      0.916     0.126 (0.015, 0.111)
                                                                at the cost of only 226 correct answers being rejected
  7         •    •        •      0.916     0.128 (0.016, 0.112) (about 1.5%).
  8              •        •      0.869     0.167 (0.030, 0.137)    Therefore, a lot of compute, money, and emissions
  9                  •    •      0.868     0.168 (0.031, 0.137)
  10                      •      0.865     0.170 (0.033, 0.137) would be saved since the assessor is far smaller than the
                                                                LMs in terms of parameters and inference time. Con-
                                                                cretely, the proposed assessor has 100 decision trees of
very low accuracies it is easy to predict that the LM will (approximately) 20000 nodes, whose inference time is in
usually fail). We also see that, in general, performance the order of 100 ⋅ log2 (20000) ≈ 1450 comparisons, much
is worse for the BIG-G models than for the GPT-3 ones. smaller than what LMs required for one pass through its
This could be due to GPT-3 being more predictable, or billions parameters.
from the availability of more data for GPT-3 (possibly
making the assessor pay more attention to the major- Table 4
ity model family in its generalisation). This observation Confusion matrix with reject threshold < 0.01 of assessor
suggests that the distribution of results of each system predictions for GPT-3. The 0 and 1 represent wrong and correct
affects the performance of the assessor accordingly. If we responses by the LM respectively.
would like to focus on building an assessor for a specific                                               Actual
LM, techniques like instance weights or oversampling                                              failure    correct
could have an effect.                                                                  failure      5114        226
                                                                        Predicted
    Comparing these results with GPT-3’s self-assessment                               correct      6004      13481
in Table 1, we can conclude that the assessor performs
slightly worse than GPT-3, but is definitely comparable
(Q2). A significant part of the difference in BS comes
from the calibration (CAL) term, and not from the re- 5. Conclusions and Future work
finement (REF) term, which is very similar for the LMs
and the assessor, especially for the smaller versions of We have illustrated how a small assessor can manage
GPT-3. This suggests that post-hoc calibration methods performance expectations at a level that is comparable
[40] like isotonic regression could still improve results to the self-assessment of giant language models with
significantly.                                                  billions of parameters. We have shown the assessor can
    In the feature importance study in Table 3, we can be well calibrated and make refined predictions. We find
see that using either the system id or the number of that the assessor picks up on system features like id or #
parameters improves performance significantly, likely parameters that explain large variances in performance.
because both can indicate the scale of the system, which We showcase how they can be used to reject instances
highly correlates with performance. The use of system id before running much larger language models, resulting
generalises slightly worse than #parameters. Other fea- in a significant saving of compute.
tures, like #shots, prompt-template id, or model family            There are of course some limitations to this work. For
and sparsity indicators have less effect on the perfor-         example,   the instance metafeatures are specific to the
mance (Q3). The assessor can easily derive the #shots           used  data-wrangling     tasks. Nonetheless, the positive
from the input (more examples results in more features results hint at future work. There are still many ways of
being present), so this makes sense. We did not measure directly improving the assessor we have used here. For
any effect on aggregated performance from the different instance, we could use post-hoc calibration with methods
prompt-templates, and it is likely this feature is simply such as isotonic regression, or add instance weights to
non-informative. Regarding model family and sparsity, the results of systems we especially care about. There are
we hypothesise that there is a large overlap between also many questions to further investigate. Do assessors
which instances the LMs solve correctly, so model archi- work for other tasks? Can we use a small LM instead
of a random forest to allow free form input? What is          [4] E. Kharitonov, A. Lee, A. Polyak, Y. Adi,
the agreement between different systems, and with the             J. Copet, K. Lakhotia, T.-A. Nguyen, M. Rivière,
assessor?                                                         A. Mohamed, E. Dupoux, W.-N. Hsu,               Text-
   These future ideas could be useful from the perspective        Free Prosody-Aware Generative Spoken Language
of saving computing costs as we outlined before, but the          Modeling, arXiv:2109.03264 [cs, eess] (2021).
schema is of wider applicability. There is a lot of useful        arXiv:2109.03264 .
information generated during the evaluation process that      [5] J. W. Rae, S. Borgeaud, T. Cai, K. Millican, J. Hoff-
is lost upon aggregation. Assessors are an attempt at             mann, F. Song, J. Aslanides, S. Henderson, R. Ring,
capturing this information and providing expectation              S. Young, et al., Scaling language models: Methods,
management that is external, fine grained, anticipative,          analysis & insights from training Gopher, arXiv
and can make use of population data. We could use them            preprint arXiv:2112.11446 (2021).
as instance-level model selectors, or we might be able        [6] D. Hendrycks, C. Burns, S. Basart, A. Zou,
apply explainability techniques on the assessor to find           M. Mazeika, D. Song, J. Steinhardt, Measuring
out what makes an instance difficult.                             massive multitask language understanding, arXiv
   There is definitely more to explore around the topic of        preprint arXiv:2009.03300 (2020).
assessors, which perform granular assessments beyond          [7] D. Hendrycks, S. Basart, S. Kadavath, M. Mazeika,
generic aggregated results: saving compute by rejecting           A. Arora, E. Guo, C. Burns, S. Puranik, H. He,
examples where the original model is going to fail is an          D. Song, J. Steinhardt, Measuring coding challenge
important illustrative application.                               competence with APPS, 2021. arXiv:2105.09938 .
                                                              [8] R. Bommasani, et al., On the opportunities
                                                                  and risks of foundation models, arXiv preprint
Acknowledgments                                                   arXiv:2108.07258, 2021.
                                                              [9] L. Ouyang, J. Wu, X. Jiang, D. Almeida, C. L. Wain-
We thank the anonymous reviewers for their comments.
                                                                  wright, P. Mishkin, C. Zhang, S. Agarwal, K. Slama,
This work has been partially supported by the Norwe-
                                                                  A. Ray, et al., Training language models to follow
gian Research Council grant 329745 Machine Teach-
                                                                  instructions with human feedback, arXiv preprint
ing for Explainable AI, also by the EU (FEDER) and
                                                                  arXiv:2203.02155 (2022).
Spanish MINECO grant RTI2018-094403-B-C32 funded
                                                             [10] A. Srivastava, A. Rastogi, et al., Beyond the imi-
by MCIN/AEI/10.13039/501100011033 and by “ERDF A
                                                                  tation game: Quantifying and extrapolating the
way of making Europe”, Generalitat Valenciana under
                                                                  capabilities of language models, 2022. URL: https:
grant PROMETEO/2019/098, EU’s Horizon 2020 research
                                                                  //arxiv.org/abs/2206.04615. doi:10.48550/ARXIV.
and innovation programme under grant agreement No.
                                                                  2206.04615 .
952215 (TAILOR), US DARPA HR00112120007 (RECoG-
                                                             [11] R. Herbei, M. H. Wegkamp, Classification with
AI), and INNEST/2021/317 (Project cofunded by the Eu-
                                                                  reject option, The Canadian Journal of Statistics/La
ropean Union with the “Programa Operativo del Fondo
                                                                  Revue Canadienne de Statistique (2006) 709–721.
Europeo de Desarrollo Regional (FEDER) de la Comunitat
                                                             [12] F. Tortorella, An optimal reject rule for binary clas-
Valenciana 2014-2020”) and ”the UPV (Vicerrectorado de
                                                                  sifiers, in: Joint IAPR International Workshops on
Investigación) grant PAI-10-21”.
                                                                  Statistical Techniques in Pattern Recognition (SPR)
                                                                  and Structural and Syntactic Pattern Recognition
References                                                        (SSPR), Springer, 2000, pp. 611–620.
                                                             [13] K. Hendrickx, L. Perini, D. Van der Plas, W. Meert,
 [1] J. Devlin, M.-W. Chang, K. Lee, K. Toutanova,                J. Davis, Machine learning with a reject option: A
     Bert: Pre-training of deep bidirectional transform-          survey, arXiv preprint arXiv:2107.11277 (2021).
     ers for language understanding, arXiv preprint          [14] Z. Jiang, J. Araki, H. Ding, G. Neubig, How
     arXiv:1810.04805 (2018).                                     Can We Know When Language Models Know?
 [2] C. Raffel, N. Shazeer, A. Roberts, K. Lee, S. Narang,        On the Calibration of Language Models for Ques-
     M. Matena, Y. Zhou, W. Li, P. J. Liu, Exploring the          tion Answering, Transactions of the Association
     limits of transfer learning with a unified text-to-          for Computational Linguistics 9 (2021) 962–977.
     text transformer, arXiv preprint arXiv:1910.10683            doi:10.1162/tacl_a_00407 .
     (2019).                                                 [15] J. Hernández-Orallo, W. Schellaert, F. Martınez-
 [3] T. B. Brown, B. Mann, N. Ryder, M. Subbiah, J. Ka-           Plumed, Training on the test set: Mapping the
     plan, P. Dhariwal, A. Neelakantan, P. Shyam, G. Sas-         system-problem space in AI, Proceedings of the
     try, A. Askell, et al., Language models are few-shot         AAAI Conference on Artificial Intelligence (2022).
     learners, in: Advances in Neural Information Pro-       [16] S. Kandel, J. Heer, C. Plaisant, J. Kennedy,
     cessing Systems, volume 33, 2020, pp. 1877–1901.             F. Van Ham, N. H. Riche, C. Weaver, B. Lee, D. Brod-
     beck, P. Buono, Research directions in data wran-             on Databases, Springer, 2017, pp. 36–48.
     gling: Visualizations and transformations for us- [28] S. Gulwani, J. Hernández-Orallo, E. Kitzelmann,
     able and credible data, Information Visualization             S. H. Muggleton, U. Schmid, B. Zorn, Inductive
     10 (2011) 271–288.                                            programming meets the real world, Communica-
[17] T. Furche, G. Gottlob, L. Libkin, G. Orsi, N. Paton,          tions of the ACM 58 (2015) 90–99.
     Data wrangling for big data: Challenges and op- [29] L. Contreras-Ochando, C. Ferri, J. Hernández-
     portunities, in: Advances in Database Technol-                Orallo, F. Martínez-Plumed, M. J. Ramírez-
     ogy—EDBT 2016: Proceedings of the 19th Interna-               Quintana, S. Katayama, Automated data transfor-
     tional Conference on Extending Database Technol-              mation with inductive programming and dynamic
     ogy, 2016, pp. 473–478.                                       background knowledge, in: Joint European Confer-
[18] G. Jaimovitch López, Comparison between machine               ence on Machine Learning and Knowledge Discov-
     learning and human learning from examples gener-              ery in Databases, Springer, 2019, pp. 735–751.
     ated with machine teaching, 2020.                        [30] S. Kandel, A. Paepcke, J. Hellerstein, J. Heer, Wran-
[19] W. Zeng, X. Ren, T. Su, H. Wang, Y. Liao, Z. Wang,            gler: Interactive visual specification of data trans-
     X. Jiang, Z. Yang, K. Wang, X. Zhang, et al.,                 formation scripts, in: Proceedings of the sigchi con-
     Pangu-𝛼: Large-scale autoregressive pretrained chi-           ference on human factors in computing systems,
     nese language models with auto-parallel computa-              2011, pp. 3363–3372.
     tion, arXiv preprint arXiv:2104.12369 (2021).            [31] T. De Bie, L. De Raedt, J. Hernández-Orallo, H. H.
[20] N. Du, Y. Huang, A. M. Dai, S. Tong, D. Lepikhin,             Hoos, P. Smyth, C. K. I. Williams, Automating data
     Y. Xu, M. Krikun, Y. Zhou, A. W. Yu, O. Firat,                science, Communications of the ACM 65 (2022)
     B. Zoph, L. Fedus, M. Bosma, Z. Zhou, T. Wang, Y. E.          76–87. doi:10.1145/3495256 .
     Wang, K. Webster, M. Pellat, K. Robinson, K. Meier- [32] R. Puri, B. Catanzaro, Zero-shot text classification
     Hellstern, T. Duke, L. Dixon, K. Zhang, Q. V. Le,             with generative language models, arXiv preprint
     Y. Wu, Z. Chen, C. Cui, GLaM: Efficient Scal-                 arXiv:1912.10165 (2019).
     ing of Language Models with Mixture-of-Experts, [33] T. Schick, H. Schütze, Exploiting cloze questions
     arXiv:2112.06905 [cs] (2021). arXiv:2112.06905 .              for few shot text classification and natural language
[21] S. Zhang, S. Roller, N. Goyal, M. Artetxe, M. Chen,           inference, arXiv preprint arXiv:2001.07676 (2020).
     S. Chen, C. Dewan, M. Diab, X. Li, X. V. Lin, T. Mi- [34] G. Jaimovitch-Lopez, C. Ferri, J. Hernandez-Orallo,
     haylov, M. Ott, S. Shleifer, K. Shuster, D. Simig, P. S.      F. Martınez-Plumed, M. J. Ramırez-Quintana, Can
     Koura, A. Sridhar, T. Wang, L. Zettlemoyer, OPT:              language models automate data wrangling?, in:
     Open Pre-trained Transformer Language Models,                 Workshop on Automating Datascience at ECML-
     arXiv:2205.01068 [cs] (2022). arXiv:2205.01068 .              PKDD, 2021, p. 13.
[22] J. Kaplan, S. McCandlish, T. Henighan, T. B. Brown, [35] B. Zoph, I. Bello, S. Kumar, N. Du, Y. Huang, J. Dean,
     B. Chess, R. Child, S. Gray, A. Radford, J. Wu,               N. Shazeer, W. Fedus, Designing effective sparse
     D. Amodei, Scaling laws for neural language                   expert models, arXiv preprint arXiv:2202.08906
     models, CoRR abs/2001.08361 (2020). URL: https:               (2022).
     //arxiv.org/abs/2001.08361. arXiv:2001.08361 .           [36] L. Breiman, Random forests, Machine learning 45
[23] R. Desislavov, F. Martínez-Plumed, J. Hernández-              (2001) 5–32.
     Orallo, Compute and energy consumption trends [37] A. Murphy, A New Vector Partition of the Prob-
     in deep learning inference,           arXiv preprint          ability Score., Journal of Applied Meteorology 12
     arXiv:2109.05472 (2021).                                      (1973) 595–600.
[24] O. Sharir, B. Peleg, Y. Shoham, The cost of training [38] P. Flach, E. Matsubara, On classification, ranking,
     NLP models: A concise overview, arXiv preprint                and probability estimation, in: Dagstuhl Seminar
     arXiv:2004.08900 (2020).                                      Proceedings, Schloss Dagstuhl-Leibniz-Zentrum fr
[25] B.     Dickson,         The      GPT-3       economy,         Informatik, 2008.
     https://bdtechtalks.com/2020/09/21/                      [39] J. Hernández-Orallo, P. Flach, C. Ferri Ramírez, A
     gpt-3-economy-business-model/, 2020.                          unified view of performance metrics: Translating
[26] D. Steinberg,            How much time needs                  threshold choice into expected classification loss,
     to be spent preparing data for analysis?,                     Journal of Machine Learning Research 13 (2012)
     http://info.salford-systems.com/blog/bid/299181/              2813–2869.
     How-Much-Time-Needs-to-be-Spent-Preparing-Data/[40] A. Bella, C. Ferri, J. Hernández-Orallo, M. J. Ramírez-
     /-for-Analysis (2013).                                        Quintana, Calibration of machine learning models,
[27] A. Bogatu, N. W. Paton, A. A. Fernandes, Towards              in: Handbook of Research on Machine Learning Ap-
     automatic data format transformations: Data wran-             plications and Trends: Algorithms, Methods, and
     gling at scale, in: British International Conference          Techniques, IGI Global, 2010, pp. 128–146.