Reject Before You Run: Small Assessors Anticipate Big Language Models Lexin Zhou1 , Fernando Martínez-Plumed1 , José Hernández-Orallo1,2 , Cèsar Ferri1 and Wout Schellaert1 1 Valencian Research Institute for Artificial Intelligence (VRAIN), Universitat Politècnica de València 2 Leverhulme Centre for the Future of Intelligence, University of Cambridge Abstract Large Language Models (LMs) are expensive to operate. It would be more frugal to avoid querying them when results are predictably bad. In this paper we therefore investigate whether it is possible to granularly predict the performance of these large LMs with a much smaller external model, the assessor, which is trained on evaluation results. For instance, given an input prompt, can an assessor estimate the probability of correct completion by a giant like GPT-3 Davinci (175B parameters)? Using a data-wrangling task included in the BIG-bench repository as a case study, we find it is indeed possible, and we report results that are comparable in accuracy and calibration to the LM itself. This suggests that, at least for some tasks, a lot of compute, money, and emissions could be spared through the assessor’s anticipative reject option. It also suggests that assessors can capture meaningful extra information from the evaluation procedure, and as such, could be a useful complement to simple aggregate metrics. Keywords Assessor, Anticipative Reject Option, Language Model, Data Wrangling, AI Evaluation, Instance Granularity, 1. Introduction 1 Date & Time Stored as Text Tue Mar 14 19:09:37 CDT 2021 Date & Time (CEST) 15/03/17 2:09:37 AM IN 2 Jul 2nd, 2019 13:37:37 (EDT) 2/07/19 19:37:37 PM 0.875 Extensive experimental research on Language Models 3 4 June 1, 2020 CMT 18:07:26 Apr 02 '17 : 0856 MDT 1/06/19 06:07:26 PM 02/04/02 05:08:56 PM 0.835 0.307 (LM) keeps showing remarkable results across several 5 20:12:20 GST (02/17/21) 20/12/20 06:17:21 AM 0.124 OUT ... ... ... domains including mathematics, question answering, lan- 65534 2020 Mar 14 11:15:45 AEST 14/03/20 11:15:45 AM 0.769 guage understanding, and code generation [1, 2, 3, 4, 5, 65535 11/03/2020 04:30 (YAKT) 11/03/20 4:30:30 AM 0.455 Tue Feb 24 12:35:05 EEST 2022 24/02/22 11:35:05 AM LM 6, 7, 8, 9]. While the performance results for many tasks 65536 0.910 are quickly improving –on average–, there is a high vari- Date & Time Stored as Text Date & Time (CEST) 2 Jul 2nd, 2019 13:37:37 (EDT) 0.870 ance in the results depending on the particular task, the 3 June 1, 2020 CMT 18:07:26 0.832 IN instances, and the prompts [10]. For a given task, one 4 Apr 02 '17 : 0856 MDT 0.250 5 20:12:20 GST (02/17/21) 0.123 OUT can partially deal with the variability across instances ... ... ... through a traditional reject rule, where we abstain from 65534 2020 Mar 14 11:15:45 AEST 0.770 Assessor 65535 11/03/2020 04:30 (YAKT) 0.315 using the model’s decision when the probability of its 65536 Tue Feb 24 12:35:05 EEST 2022 0.915 answer (i.e., its “confidence”) falls below a certain thresh- old [11, 12, 13]. This requires a good calibration of the Figure 1: (Top) Process of a LM generating the solution for model. However, even if LMs were well calibrated, and a date transformation (repetitive) problem in a spreadsheet. Once the user prompts one instance of the desired transfor- they are generally not [10, 14], it would also still require mation (row 1), the LM proceeds to transforming the rest of actually running the inference. For large LMs this comes instances (rows 2 & onward). (Bottom) Process of an assessor at a non-negligible cost per token, either in required in- that can reliably predict beforehand the performance of the LM at the instance-level. EBeM’22: Workshop on AI Evaluation Beyond Metrics, July 25, 2022, Vienna, Austria Envelope-Open lzhou@inf.upv.es (L. Zhou); fermarpl@dsic.upv.es frastructure or through the price of the API. To avoid (F. Martínez-Plumed); jorallo@dsic.upv.es (J. Hernández-Orallo); cferri@dsic.upv.es (C. Ferri); wschell@vrain.upv.es (W. Schellaert) being wasteful, we explore how much we can anticipate GLOBE https://lexzhou.github.io/ (L. Zhou); https://nandomp.github.io/ the level of success for a particular instance (or collection (F. Martínez-Plumed); http://josephorallo.webs.upv.es/ of instances), without running it through the LM at all. (J. Hernández-Orallo); http://personales.upv.es/ceferra/ (C. Ferri); For this, we need an assessor: an external conditional https://schellaert.org/ (W. Schellaert) probability (or density) estimator that can reliably pre- Orcid 0000-0003-1161-4270 (L. Zhou); 0000-0003-2902-6477 (F. Martínez-Plumed); 0000-0001-9746-7632 (J. Hernández-Orallo); dict beforehand the performance of an LM at instance 0000-0002-8975-1120 (C. Ferri); 0000-0002-9182-4747 (W. Schellaert) granularity [15]. With a good assessor, we could make © 2022 Copyright for this paper by its authors. Use permitted under Creative Commons License Attribution 4.0 International (CC BY 4.0). the calculation of whether it is actually worth asking the CEUR Workshop Proceedings http://ceur-ws.org ISSN 1613-0073 CEUR Workshop Proceedings (CEUR-WS.org) LM for an answer, depending on factors such as the value of supplied examples. For instance, 5-shot inference is of a correct result, the cost of running the model, and of usually better than 2-shot inference, but requires more course the performance estimated by the assessor. See effort from the user. Figure 1 for an illustrative example. However, on many occasions the cost of running LMs This paper describes a (successful) attempt at building is not negligible in both computational [22, 23] and eco- such an assessor for a collection of large LMs consisting nomic terms [24]. Large LMs, open source or not, all have of various scales of GPT-3 [3] and BIG-G [10] models steep development costs in common. A recent study [24] for application to a diverse set of data-wrangling tasks. puts the cost of developing a LM with only 1.5 billion Data-wrangling [16, 17] is a notoriously time consuming parameters at $1.6 million. Inference costs is another data preparation chore where LMs have recently shown drain. [25] estimates the cost of running GPT-3, if run in promising results [18]. We discuss data-wrangling and the cloud, at a minimum of $87,000 per year, with current the considerations regarding the use of LMs in section API price for Davinci being 6 cents per 750 words1 . Of 2.1 and 2.2. course, these costs go down quickly as compute becomes cheaper, but larger models are expected to replace the Contributions old ones quickly to set the new state of the art. Also, as To our knowledge, this is the first paper analysing LMs increase their performance, their penetration rates assessors applied to the language domain, and to a will increase, becoming widespread in billions of semi- plausible use case in general. Additionally, automated operations in many domains, and compute • We find that lightweight assessors can give might easily become more of an issue, not less. reliable instance-level predictions of the performance of large LMs. 2.2. Data Wrangling • We find that their predictions are Data-wrangling [16, 17] is a data preparation task that well-calibrated and unbiased, again comparable to the self-assessment of the data janitors, data scientists and other people operating LMs. with forms, spreadsheets and other data formatting sit- uations consider a very monotonous and laborious part • We investigate the contributions of various of their jobs.Data wrangling can require as much as 80 features like #shots and #parameters to assessor performance. percent of their time [26], including tediously transform- ing data presented from heterogeneous formats into a standardised format for efficient access, understanding, and analysis. One of the challenges in data-wrangling automation consists of selecting the correct (string) trans- 2. Background formations from the vast set of possible ones, and doing so by only having seen a few examples [27]. Many ap- In this section we revisit some key ideas of LMs, their proaches have attempted to address this challenge by costs, their applications to the data wrangling problem, reducing the transformation space through the incorpo- and the traditional (post-hoc) reject option. We also sum- ration of prior knowledge [28, 29]. This led to a many marise the main elements of the recently introduced con- tools that use domain-specific languages or needing ad- cept of assessor models. hoc solutions [30]. Because LMs capture vast amounts of human knowl- 2.1. (Large) Language Models edge across many different domains, they can be specially effective for more open-ended tasks, and as such data In less than a decade, research in Natural Language Pro- wrangling is recognised in data science automation [31]. cessing (NLP) has been overturned by the appearance Using few-shot inference [3, 32, 33], LMs have shown of a suite of LMs trained in an unsupervised manner on promising yet unreliable results for data wrangling. For very large corpora. LMs are capturing more and more instance, in [18] GPT-3 Davinci (prompted, not finetuned) of the information in natural language, including the lin- achieves a 56% accuracy in the 1-shot setting, 68% with guistic characteristics of various human languages and the 4-shot setting, and almost 90% with 10 shots. Addi- associated knowledge. Moreover, these models can be tionally, as opposed to LM results on other tasks, GPT-3 adapted (e.g., through fine-tuning) to a wide range of is also relatively well calibrated in the data-wrangling downstream tasks [8]. Recent LMs such as GPT-3 [3], task, reporting a Brier score of 0.11 (see section 4). PanGu-𝛼 [19], GLaM [20] and OPT [21] have excelled at few-shot inference, where a task is solved by supplying a small set of correct examples formatted as a prompt. The quality of the completion usually depends on the number 1 https://openai.com/api/pricing/ 2.3. Reject option 3. Methods Given these unreliable accuracies but good calibration In this section we identify the experimental setting, in- scores, we could have a more reliable and effective use cluding goals of the analysis, the data sources and how of these systems by not using those for which the confi- they are converted into evaluation records, and how we dence of the system is low. In other words, if we know build the assessor from them. for which instances the LM is (likely to be) wrong, we can abstain from using the output of the LM in these cases. This is called a ‘reject option’, and a classic and 3.1. Experimental Questions straightforward implementation for it is to use a confi- We set three experimental questions: dence threshold 𝑡 and compare it with the probability 𝑝(𝑦|𝑥) ̂ that a model 𝑝 assigns to its output 𝑦.̂ This repre- Q1: Can we build lightweight yet good assessors for sents the self-assigned probability of being correct (i.e., language models in this domain? its confidence) [11, 12, 13]. We set 𝑡 to match the error tolerance of the use case, and when 𝑝(𝑦|𝑥) ̂ < 𝑡 , we do Q2: Are the assessors of comparable quality to the not use the output of the model, usually delegating to a language models when estimating probabilities? human. However, this classical interpretation of the re- Q3: What features from the systems and the instances ject rule still requires running the model. As mentioned are most relevant for predicting success and con- before, this can be expensive for large LMs. Whenever sequently for building good assessors? the reject rule triggers, it is not only that humans need to do the task manually, but we have also incurred a cost in the computation of a model that is effectively wasted. 3.2. Data Sources and Train-Test Split We work with the Data Wrangling Dataset Repository2 , 2.4. Assessors containing 119 tasks from 7 domains (dates, emails, free text, names, phones, times, and units). In particular, we Assessor models [15] provide an external anticipative use results (at instance level) from multiple LMs obtained reject option instead. Assessors are conditional proba- from two different evaluation efforts. First, [34] have ̂ bility (or density) estimators 𝑅(𝑟|𝜋, 𝜇) that are trained on produced granular results of the evaluation of different evaluation data. With ‘evaluation data’ we mean a set of versions of GPT-3. We have 146k instances available for evaluation records ⟨𝜋, 𝜇, 𝑟⟩, where 𝜋 refers to a profile or GPT-3 models Ada (350M), Babbage (1.3B), Curie (6.7B), description of a particular system (e.g., deployment con- and Davinci (175B), from 0-shot to 10-shot. More in- ditions, state, system architecture, or hyperparameters), formation about the architectures can be found in [3]. 𝜇 refers to a particular instance (e.g., a prompt), and 𝑟 to Second, [10] provides results on the same benchmark for an empirical measurement of the performance of 𝜋 on 𝜇. a collection of Google LMs of various parameter sizes. Assessor models are meant to act as general mappings Here we extract 86k instances, from 0-shot to 3-shot, between the space of systems, the space of instances, for 22 models with parameter sizes ranging from 2M to and the corresponding distribution of scores. They are a 128B across two different model families, a decoder-only way of capturing all available evaluation information in dense transformer (BIG-G dense) and a sparse Mixture- a single predictive model that could be used, e.g., to in- of-Experts [35] model (BIG-G sparse). More information vestigate what features make an instance difficult, to add on the BIG-G network architectures is available in [10]. confidence capabilities to systems that do not have them, All models (GPT-3 and BIG-G variants) were queried with or to select the optimal model for a specific instance. In temperature set to 0, and none of them were fine-tuned this case, we focus on their use to provide an anticipa- for the data-wrangling task. tive reject option: when 𝑅̂ is built and shown to be an As the assessor is trained on a somewhat heteroge- accurate estimator, we can use it to make inferences on neous collection of systems and instances, we have to be the expected performance 𝑅(𝑟 ̂ = 1|𝜋, 𝜇) given a system 𝜋 careful to define a train-test partition of the evaluation and instance 𝜇 (or a collection of those). results without contamination or information leakage. We do still have to run actual inference on the assessor, To this purpose, we must ensure that the same instances but as we show in the experiments, they have the possi- are consistently used across systems and shots. For ex- bility of being multiple orders of magnitude smaller than ample, we have to avoid that the result of BIG-G dense the LMs, allowing us to cheaply avoid any LM inference with 2-shots on instance 𝑖 is in the training set, while that is doomed to fail. GPT-3 Ada’s result with 0-shots on the same instance is in the test set. Figure 2 shows a visual representation of the partition requirements that ensure that this does not 2 http://dmip.webs.upv.es/datawrangling/ 3 for an example). We refer to [29] for an overview. The binary metafeatures are available for all input and output that is in the prompt, so for example for a 2-shot prompt, we would have 2 inputs and 2 outputs from the examples, and 1 input for the actual question, totalling 5 ⋅ 54 = 270 features. Prompted data Test data 24 - 07 - 22 22 ebem@ws.edu hasPunctuation isNumeric hasDot Figure 2: Illustration of the matching requirements for mak- hasDigits hasAt startWithDigit startLower ing a train-test partition for the assessor. Each column repre- sents a data wrangling prompt used to evaluate a LM. Orange Figure 3: Example of metafeatures that can be extracted from columns represent instances included in the training set for the examples of different domains (dates and emails in the the assessor, while green represents those included in the figure). Adapted from [29]. test set for the assessor. To avoid contamination, the same instances should be used across different shots and systems. 3.3.3. Score happen. The order-matched train-test partition leads to For the data wrangling tasks, all scores are binary: 1 if 194k training instances and 38k testing instances. the output of the LM matches the target string exactly, and 0 otherwise. The score is what the assessors must 3.3. Anatomy of the Evaluation Record predict, and thus acts as a label during training. From our two data sources, we receive records of the shape ⟨system id, #shots, prompt, score⟩. We further an- 3.4. Assessor Building and Evaluation notate this record with features describing the system For the assessor model, we train a Random Forest [36] (𝜋), and extract meta features of the instance (𝜇) that are of 100 decision trees, a minimum node size of 5, and fit for tabular representation (as opposed to free form select randomly 50% of the available variables in each text). In the end, this creates a general record of the shape split, tuned through grid search on a validation set us- ⟨𝜋, 𝜇, 𝑟⟩ = ⟨⟨system features⟩, ⟨instance features⟩, score⟩.ing 84%-16% training-validation split4 from the training We describe these features in detail below, but ultimately set defined previously. For the remaining hyperparame- the only constraint for making a useful assessor is that ters the defaults were used5 . We report the Area Under all system and instance features are available without Receiver Operating Characteristic Curve (AUROC) and actually running the original model. Brier Score (BS), as well as its decomposition into cali- bration and refinement loss [37, 38, 39] 3.3.1. System features As a baseline to compare assessors to, we take the standard approach of interpreting the probability 𝑝(𝑦|𝑥)̂ The available system features include a system id that the LM assigns to its output 𝑦̂ as the “confidence” of the refers to a specific trained LM, i.e., a set of learned pa- model, i.e. its self-assessed probability of being correct. rameters fitting a certain architecture, the id of that ar- However, there is no data 𝑝(𝑦|𝑥) ̂ recorded in the BIG- chitecture (either GPT-3, BIG-G sparse, or BIG-G dense), bench logs, so we cannot compare the assessor AUROC whether a model is dense or sparse, and the number of or BS to those of the BIG-G family of models. For GPT-3 parameters. These features will of course be the same this information is available. for all records of the same trained model. 3.3.2. Instance features 4. Results and Discussion Instance features include the number of shots, the id of Since it is assumed that all models give better results with the prompt-template3 , and 54 simple binary metafeatures 𝑛 + 1 shots than with 𝑛 shots, Figure 4 shows the accu- that can be automatically extracted through simple regu- racies of the LMs, with the maximum number of shots lar expressions from the original text. Examples include the kind of symbols the instance contains (e.g., numbers, 4 The non-standard train-test partition is the result of the in- dots, dashes) or whether it starts with a digit (see Figure stance matching procedure described in section 3.2. 5 The RandomForest package (https://cran.r-project.org/web/ 3 The prompt-template differs between [10] and [34], but the packages/randomForest/index.html) was used for training the asses- same metafeatures can be extracted. sor model. Table 1 AUROC and BS (Calibration, Refinement) in the GPT-3 data, for a single assessor trained with both GPT-3 and BIG-G data using all available features (except prompt-template id), alongside the self-estimation from GPT-3 LMs. In the bottom row, AUROC and BS are not averaged, but calculated from the aggregated set of instances. The average accuracies from GPT-3 LMs (with std. dev. across #shots) on the original data-wrangling task are also presented, and serve as an indication of the class distribution the assessor has to deal with. 𝑅̂ GPT-3 self-estimation id LM Acc. AUROC BS (CAL, REF) AUROC BS (CAL, REF) GPT-3 Ada 350M 0.524±0.232 0.901 0.144 (0.033, 0.111) 0.908 0.122 (0.005, 0.117) GPT-3 Babbage 1.3B 0.580±0.240 0.914 0.141 (0.036, 0.106) 0.920 0.116 (0.004, 0.102) GPT-3 Curie 6.7B 0.625±0.244 0.918 0.130 (0.025, 0.105) 0.934 0.108 (0.011, 0.097) GPT-3 Davinci 175B 0.689±0.253 0.917 0.125 (0.022, 0.099) 0.944 0.096 (0.008, 0.087) Aggregated 0.604±0.262 0.916 0.135 (0.024, 0.111) 0.929 0.110 (0.005, 0.105) used in BIG-G data (3-shots), the same number of shots Table 2 for GPT-3 (for comparability), and the maximum used AUROC and BS (Calibration, Refinement) for BIG-G data (10-shots) on the original data-wrangling tasks. Despite using a single assessor trained with both GPT-3 and BIG-G the promising progress in the state-of-the-art capabili- data using all available features (except prompt-template id). ties of LMs, they still struggle to master data-wrangling In the bottom row, AUROC and BS are not averaged, but tasks with very few shots. For 3-shot inference, BIG-G calculated from the aggregated set of instances. The average accuracies of the LM (with std. dev. across #shots) on the dense 128B achieves an accuracy of 0.776, outperforming original data-wrangling task are also presented, and serve as BIG-G sparse 8B and GPT-3 175B. When it comes to 10- an indication of the class distribution the assessor has to deal shots (only GPT-3 was available), GPT-3 175B achieves a with. promising accuracy of nearly 90%, outperforming other 𝑅̂ GPT-3 variants. id LM Acc. AUROC BS (CAL, REF) 175B BIG-G sparse 2M 0.018±0.008 0.919 0.005 (0.002, 0.003) BIG-G sparse 16M 0.054±0.015 0.636 0.022 (0.007, 0.015) 128B 1.3B 6.7B 8B BIG-G sparse 53M 0.103±0.044 0.747 0.064 (0.020, 0.044) 0.75 27B 2B 4B 8B BIG-G sparse 125M 0.250±0.144 0.833 0.121 (0.044, 0.077) 350M 175B 1B 4B BIG-G sparse 244M 0.330±0.199 0.844 0.135 (0.051, 0.084) 350M 1B 2B 6.7B BIG-G sparse 422M 0.376±0.232 0.840 0.151 (0.064, 0.084) BIG-G sparse 1B 0.445±0.267 0.867 0.147 (0.061, 0.087) Accuracy 422M 1.3B 0.50 244M 422M BIG-G sparse 2B 0.479±0.296 0.885 0.139 (0.060, 0.079) 125M 244M BIG-G sparse 4B 0.491±0.302 0.877 0.148 (0.060, 0.088) BIG-G sparse 8B 0.533±0.325 0.866 0.155 (0.057, 0.098) 0.25 125M BIG-G dense 2M 0.013±0.001 0.919 0.011 (0.003, 0.008) 53M a BIG−G dense (3−shot) BIG-G dense 16M 0.051±0.012 0.781 0.020 (0.009, 0.010) a BIG−G sparse (3−shot) a GPT−3 (10−shot) BIG-G dense 53M 0.118±0.056 0.741 0.073 (0.018, 0.055) 16M 53M 2M a GPT−3 (3−shot) BIG-G dense 125M 0.207±0.117 0.809 0.117 (0.047, 0.070) 16M 0.00 2M BIG-G dense 244M 0.291±0.179 0.836 0.133 (0.047, 0.086) 101 102 103 104 105 BIG-G dense 422M 0.331±0.203 0.784 0.165 (0.065, 0.100) Parameters (M) BIG-G dense 1B 0.407±0.258 0.853 0.154 (0.073, 0.081) BIG-G dense 2B 0.447±0.276 0.834 0.173 (0.069, 0.104) BIG-G dense 4B 0.479±0.292 0.875 0.147 (0.057, 0.090) Figure 4: LMs’ accuracies per LM and size on the original BIG-G dense 8B 0.493±0.304 0.869 0.151 (0.049, 0.102) BIG-G dense 27B 0.516±0.321 0.883 0.144 (0.058, 0.086) data-wrangling task. Logarithmic scale used on the 𝑥-axis. BIG-G dense 128B 0.574±0.353 0.857 0.164 (0.059, 0.104) Aggregated (BIG-G sparse) 0.308±0.275 0.894 0.109 (0.012, 0.097) Aggregated (BIG-G dense) 0.328±0.273 0.884 0.121 (0.015, 0.106) Table 1 describes the AUROC and BS —decomposed into calibration loss (CAL) and refinement loss (REF)— for the GPT-3 data given by an assessor trained with instance features on the performance of the assessor. all features available except for the prompt-template Analysing the results in Table 1 and Table 2, we see id, along with the GPT-3’s self-assessment on the same relatively good results overall in the assessor’s perfor- set of instances. Table 2 describes the AUROC and BS mance, reporting AUROCs of around 0.9, and BSs around —decomposed into calibration loss (CAL) and refinement 0.12 (Q1). It should be noted that the metrics for the loss (REF)— for the BIG-G data given by the assessor. smallest LMs have to be interpreted cautiously due to the Finally, Table 3 shows the impact of various system and significant imbalance in LM scores distribution (i.e., for Table 3 tecture is not indicative of major performance differences Ablation study of the impact of various features on assessor (or the assessor fails to pick up on them). performance. The 54 instance metafeatures are always in- Finally, we discuss a concrete example using the as- cluded. Row 4, in italics, indicates the assessor we reported insessor to implement a reject rule (see Table ??). For the Table 1 and Table 2. GPT-3 data in the test set (24604 instances), we take a reject threshold of 1%, i.e., we reject instances where the s. s er ar i d et assessor deems it is less than 1% likely the LM would suc- sp id e m at em s .& ra ot pl ceed. The assessor rejects about 5340 instances, which pa sh m m st AUROC (𝑅)̂ BS (CAL, REF) (𝑅)̂ sy fa te # 1 • # 0.909 0.130 (0.015, 0.115) account for 21.7% of the instances and (approximately) 2 • • 0.910 0.130 (0.015, 0.115) 3 • • • 0.912 0.127 (0.014, 0.113) the total compute. From these 5340, we have that 5114 4 • • • • 0.916 0.128 (0.017, 0.111) are correctly rejected, representing 46% of the failures, 5 • • • 0.917 0.126 (0.015, 0.111) 6 • • 0.916 0.126 (0.015, 0.111) at the cost of only 226 correct answers being rejected 7 • • • 0.916 0.128 (0.016, 0.112) (about 1.5%). 8 • • 0.869 0.167 (0.030, 0.137) Therefore, a lot of compute, money, and emissions 9 • • 0.868 0.168 (0.031, 0.137) 10 • 0.865 0.170 (0.033, 0.137) would be saved since the assessor is far smaller than the LMs in terms of parameters and inference time. Con- cretely, the proposed assessor has 100 decision trees of very low accuracies it is easy to predict that the LM will (approximately) 20000 nodes, whose inference time is in usually fail). We also see that, in general, performance the order of 100 ⋅ log2 (20000) ≈ 1450 comparisons, much is worse for the BIG-G models than for the GPT-3 ones. smaller than what LMs required for one pass through its This could be due to GPT-3 being more predictable, or billions parameters. from the availability of more data for GPT-3 (possibly making the assessor pay more attention to the major- Table 4 ity model family in its generalisation). This observation Confusion matrix with reject threshold < 0.01 of assessor suggests that the distribution of results of each system predictions for GPT-3. The 0 and 1 represent wrong and correct affects the performance of the assessor accordingly. If we responses by the LM respectively. would like to focus on building an assessor for a specific Actual LM, techniques like instance weights or oversampling failure correct could have an effect. failure 5114 226 Predicted Comparing these results with GPT-3’s self-assessment correct 6004 13481 in Table 1, we can conclude that the assessor performs slightly worse than GPT-3, but is definitely comparable (Q2). A significant part of the difference in BS comes from the calibration (CAL) term, and not from the re- 5. Conclusions and Future work finement (REF) term, which is very similar for the LMs and the assessor, especially for the smaller versions of We have illustrated how a small assessor can manage GPT-3. This suggests that post-hoc calibration methods performance expectations at a level that is comparable [40] like isotonic regression could still improve results to the self-assessment of giant language models with significantly. billions of parameters. We have shown the assessor can In the feature importance study in Table 3, we can be well calibrated and make refined predictions. We find see that using either the system id or the number of that the assessor picks up on system features like id or # parameters improves performance significantly, likely parameters that explain large variances in performance. because both can indicate the scale of the system, which We showcase how they can be used to reject instances highly correlates with performance. The use of system id before running much larger language models, resulting generalises slightly worse than #parameters. Other fea- in a significant saving of compute. tures, like #shots, prompt-template id, or model family There are of course some limitations to this work. For and sparsity indicators have less effect on the perfor- example, the instance metafeatures are specific to the mance (Q3). The assessor can easily derive the #shots used data-wrangling tasks. Nonetheless, the positive from the input (more examples results in more features results hint at future work. There are still many ways of being present), so this makes sense. We did not measure directly improving the assessor we have used here. For any effect on aggregated performance from the different instance, we could use post-hoc calibration with methods prompt-templates, and it is likely this feature is simply such as isotonic regression, or add instance weights to non-informative. Regarding model family and sparsity, the results of systems we especially care about. There are we hypothesise that there is a large overlap between also many questions to further investigate. Do assessors which instances the LMs solve correctly, so model archi- work for other tasks? Can we use a small LM instead of a random forest to allow free form input? What is [4] E. Kharitonov, A. Lee, A. Polyak, Y. Adi, the agreement between different systems, and with the J. Copet, K. Lakhotia, T.-A. Nguyen, M. Rivière, assessor? A. Mohamed, E. Dupoux, W.-N. Hsu, Text- These future ideas could be useful from the perspective Free Prosody-Aware Generative Spoken Language of saving computing costs as we outlined before, but the Modeling, arXiv:2109.03264 [cs, eess] (2021). schema is of wider applicability. There is a lot of useful arXiv:2109.03264 . information generated during the evaluation process that [5] J. W. Rae, S. Borgeaud, T. Cai, K. Millican, J. Hoff- is lost upon aggregation. Assessors are an attempt at mann, F. Song, J. Aslanides, S. Henderson, R. Ring, capturing this information and providing expectation S. Young, et al., Scaling language models: Methods, management that is external, fine grained, anticipative, analysis & insights from training Gopher, arXiv and can make use of population data. We could use them preprint arXiv:2112.11446 (2021). as instance-level model selectors, or we might be able [6] D. Hendrycks, C. Burns, S. Basart, A. Zou, apply explainability techniques on the assessor to find M. Mazeika, D. Song, J. Steinhardt, Measuring out what makes an instance difficult. massive multitask language understanding, arXiv There is definitely more to explore around the topic of preprint arXiv:2009.03300 (2020). assessors, which perform granular assessments beyond [7] D. Hendrycks, S. Basart, S. Kadavath, M. Mazeika, generic aggregated results: saving compute by rejecting A. Arora, E. Guo, C. Burns, S. Puranik, H. He, examples where the original model is going to fail is an D. Song, J. Steinhardt, Measuring coding challenge important illustrative application. competence with APPS, 2021. arXiv:2105.09938 . [8] R. Bommasani, et al., On the opportunities and risks of foundation models, arXiv preprint Acknowledgments arXiv:2108.07258, 2021. [9] L. Ouyang, J. Wu, X. Jiang, D. Almeida, C. L. Wain- We thank the anonymous reviewers for their comments. wright, P. Mishkin, C. Zhang, S. Agarwal, K. Slama, This work has been partially supported by the Norwe- A. Ray, et al., Training language models to follow gian Research Council grant 329745 Machine Teach- instructions with human feedback, arXiv preprint ing for Explainable AI, also by the EU (FEDER) and arXiv:2203.02155 (2022). Spanish MINECO grant RTI2018-094403-B-C32 funded [10] A. Srivastava, A. Rastogi, et al., Beyond the imi- by MCIN/AEI/10.13039/501100011033 and by “ERDF A tation game: Quantifying and extrapolating the way of making Europe”, Generalitat Valenciana under capabilities of language models, 2022. URL: https: grant PROMETEO/2019/098, EU’s Horizon 2020 research //arxiv.org/abs/2206.04615. doi:10.48550/ARXIV. and innovation programme under grant agreement No. 2206.04615 . 952215 (TAILOR), US DARPA HR00112120007 (RECoG- [11] R. Herbei, M. H. Wegkamp, Classification with AI), and INNEST/2021/317 (Project cofunded by the Eu- reject option, The Canadian Journal of Statistics/La ropean Union with the “Programa Operativo del Fondo Revue Canadienne de Statistique (2006) 709–721. Europeo de Desarrollo Regional (FEDER) de la Comunitat [12] F. Tortorella, An optimal reject rule for binary clas- Valenciana 2014-2020”) and ”the UPV (Vicerrectorado de sifiers, in: Joint IAPR International Workshops on Investigación) grant PAI-10-21”. Statistical Techniques in Pattern Recognition (SPR) and Structural and Syntactic Pattern Recognition References (SSPR), Springer, 2000, pp. 611–620. [13] K. Hendrickx, L. Perini, D. Van der Plas, W. Meert, [1] J. Devlin, M.-W. Chang, K. Lee, K. Toutanova, J. Davis, Machine learning with a reject option: A Bert: Pre-training of deep bidirectional transform- survey, arXiv preprint arXiv:2107.11277 (2021). ers for language understanding, arXiv preprint [14] Z. Jiang, J. Araki, H. Ding, G. Neubig, How arXiv:1810.04805 (2018). Can We Know When Language Models Know? [2] C. Raffel, N. Shazeer, A. Roberts, K. Lee, S. Narang, On the Calibration of Language Models for Ques- M. Matena, Y. Zhou, W. Li, P. J. Liu, Exploring the tion Answering, Transactions of the Association limits of transfer learning with a unified text-to- for Computational Linguistics 9 (2021) 962–977. text transformer, arXiv preprint arXiv:1910.10683 doi:10.1162/tacl_a_00407 . (2019). [15] J. Hernández-Orallo, W. Schellaert, F. Martınez- [3] T. B. Brown, B. Mann, N. Ryder, M. Subbiah, J. Ka- Plumed, Training on the test set: Mapping the plan, P. Dhariwal, A. Neelakantan, P. Shyam, G. Sas- system-problem space in AI, Proceedings of the try, A. Askell, et al., Language models are few-shot AAAI Conference on Artificial Intelligence (2022). learners, in: Advances in Neural Information Pro- [16] S. Kandel, J. Heer, C. Plaisant, J. Kennedy, cessing Systems, volume 33, 2020, pp. 1877–1901. F. Van Ham, N. H. Riche, C. Weaver, B. Lee, D. Brod- beck, P. Buono, Research directions in data wran- on Databases, Springer, 2017, pp. 36–48. gling: Visualizations and transformations for us- [28] S. Gulwani, J. Hernández-Orallo, E. Kitzelmann, able and credible data, Information Visualization S. H. Muggleton, U. Schmid, B. Zorn, Inductive 10 (2011) 271–288. programming meets the real world, Communica- [17] T. Furche, G. Gottlob, L. Libkin, G. Orsi, N. Paton, tions of the ACM 58 (2015) 90–99. Data wrangling for big data: Challenges and op- [29] L. Contreras-Ochando, C. Ferri, J. Hernández- portunities, in: Advances in Database Technol- Orallo, F. Martínez-Plumed, M. J. Ramírez- ogy—EDBT 2016: Proceedings of the 19th Interna- Quintana, S. Katayama, Automated data transfor- tional Conference on Extending Database Technol- mation with inductive programming and dynamic ogy, 2016, pp. 473–478. background knowledge, in: Joint European Confer- [18] G. Jaimovitch López, Comparison between machine ence on Machine Learning and Knowledge Discov- learning and human learning from examples gener- ery in Databases, Springer, 2019, pp. 735–751. ated with machine teaching, 2020. [30] S. Kandel, A. Paepcke, J. Hellerstein, J. Heer, Wran- [19] W. Zeng, X. Ren, T. Su, H. Wang, Y. Liao, Z. Wang, gler: Interactive visual specification of data trans- X. Jiang, Z. Yang, K. Wang, X. Zhang, et al., formation scripts, in: Proceedings of the sigchi con- Pangu-𝛼: Large-scale autoregressive pretrained chi- ference on human factors in computing systems, nese language models with auto-parallel computa- 2011, pp. 3363–3372. tion, arXiv preprint arXiv:2104.12369 (2021). [31] T. De Bie, L. De Raedt, J. Hernández-Orallo, H. H. [20] N. Du, Y. Huang, A. M. Dai, S. Tong, D. Lepikhin, Hoos, P. Smyth, C. K. I. Williams, Automating data Y. Xu, M. Krikun, Y. Zhou, A. W. Yu, O. Firat, science, Communications of the ACM 65 (2022) B. Zoph, L. Fedus, M. Bosma, Z. Zhou, T. Wang, Y. E. 76–87. doi:10.1145/3495256 . Wang, K. Webster, M. Pellat, K. Robinson, K. Meier- [32] R. Puri, B. Catanzaro, Zero-shot text classification Hellstern, T. Duke, L. Dixon, K. Zhang, Q. V. Le, with generative language models, arXiv preprint Y. Wu, Z. Chen, C. Cui, GLaM: Efficient Scal- arXiv:1912.10165 (2019). ing of Language Models with Mixture-of-Experts, [33] T. Schick, H. Schütze, Exploiting cloze questions arXiv:2112.06905 [cs] (2021). arXiv:2112.06905 . for few shot text classification and natural language [21] S. Zhang, S. Roller, N. Goyal, M. Artetxe, M. Chen, inference, arXiv preprint arXiv:2001.07676 (2020). S. Chen, C. Dewan, M. Diab, X. Li, X. V. Lin, T. Mi- [34] G. Jaimovitch-Lopez, C. Ferri, J. Hernandez-Orallo, haylov, M. Ott, S. Shleifer, K. Shuster, D. Simig, P. S. F. Martınez-Plumed, M. J. Ramırez-Quintana, Can Koura, A. Sridhar, T. Wang, L. Zettlemoyer, OPT: language models automate data wrangling?, in: Open Pre-trained Transformer Language Models, Workshop on Automating Datascience at ECML- arXiv:2205.01068 [cs] (2022). arXiv:2205.01068 . PKDD, 2021, p. 13. [22] J. Kaplan, S. McCandlish, T. Henighan, T. B. Brown, [35] B. Zoph, I. Bello, S. Kumar, N. Du, Y. Huang, J. Dean, B. Chess, R. Child, S. Gray, A. Radford, J. Wu, N. Shazeer, W. Fedus, Designing effective sparse D. Amodei, Scaling laws for neural language expert models, arXiv preprint arXiv:2202.08906 models, CoRR abs/2001.08361 (2020). URL: https: (2022). //arxiv.org/abs/2001.08361. arXiv:2001.08361 . [36] L. Breiman, Random forests, Machine learning 45 [23] R. Desislavov, F. Martínez-Plumed, J. Hernández- (2001) 5–32. Orallo, Compute and energy consumption trends [37] A. Murphy, A New Vector Partition of the Prob- in deep learning inference, arXiv preprint ability Score., Journal of Applied Meteorology 12 arXiv:2109.05472 (2021). (1973) 595–600. [24] O. Sharir, B. Peleg, Y. Shoham, The cost of training [38] P. Flach, E. Matsubara, On classification, ranking, NLP models: A concise overview, arXiv preprint and probability estimation, in: Dagstuhl Seminar arXiv:2004.08900 (2020). Proceedings, Schloss Dagstuhl-Leibniz-Zentrum fr [25] B. Dickson, The GPT-3 economy, Informatik, 2008. https://bdtechtalks.com/2020/09/21/ [39] J. Hernández-Orallo, P. Flach, C. Ferri Ramírez, A gpt-3-economy-business-model/, 2020. unified view of performance metrics: Translating [26] D. Steinberg, How much time needs threshold choice into expected classification loss, to be spent preparing data for analysis?, Journal of Machine Learning Research 13 (2012) http://info.salford-systems.com/blog/bid/299181/ 2813–2869. How-Much-Time-Needs-to-be-Spent-Preparing-Data/[40] A. Bella, C. Ferri, J. Hernández-Orallo, M. J. Ramírez- /-for-Analysis (2013). Quintana, Calibration of machine learning models, [27] A. Bogatu, N. W. Paton, A. A. Fernandes, Towards in: Handbook of Research on Machine Learning Ap- automatic data format transformations: Data wran- plications and Trends: Algorithms, Methods, and gling at scale, in: British International Conference Techniques, IGI Global, 2010, pp. 128–146.