1. Introduction

Workshop on AI Evaluation Beyond Metrics, July

10.1145/3495256

Run: Small Assessors Anticipate Big Language Models

WoutSchellaert

0 2 3 4

Lexin Zhou

lzhou@inf.upv.es 0 2 3 4

FernandoMartínez-Plumed

0 2 3 4

José Hernández-Orall

0 2 3 4

CèsarFerr

0 2 3 4

Assessor, Anticipative Reject Option, Language Model, Data Wrangling, AI Evaluation, Instance Granularity,

0 For this , we need an assessor: an external conditional 1 Leverhulme Centre for the Future of Intelligence, University of Cambridge 2 Valencian Research Institute for Artificial Intelligence (VRAIN), Universitat Politècnica de València 3 Workshop Proce dings 4 of instances), without running it through the LM at all

2022

25 2022 36 48

Large Language Models (LMs) are expensive to operate. It would be more frugal to avoid querying them when results are predictably bad. In this paper we therefore investigate whether it is possible to granularly predict the performance of these large LMs with a much smaller external model, the assessor, which is trained on evaluation results. For instance, given an input prompt, can an assessor estimate the probability of correct completion by a giant like GPT-3 Davinci (175B parameters)? Using a data-wrangling task included in the BIG-bench repository as a case study, we find it is indeed possible, and we report results that are comparable in accuracy and calibration to the LM itself. This suggests that, at least for some tasks, a lot of compute, money, and emissions could be spared through the assessor's anticipative reject option. It also suggests that assessors can capture meaningful extra information from the evaluation procedure, and as such, could be a useful complement to simple aggregate metrics.

1. Introduction

Extensive experimental research on Language Model3s (LM) keeps showing remarkable results across severa5l domains including mathematics, question answering, lan.-.. guage understanding, and code generati1o, n2,[3, 4, 5, 1 2 4

Date & Time Stored as Text Tue Mar 14 19:09:37 CDT 2021 Jul 2nd, 2019 13:37:37 (EDT) June 1, 2020 CMT 18:07:26 Apr 02 '17 : 0856 MDT 20:12:20 GST (02/17/21) 65534 65535 2020 Mar 14 11:15:45 AEST 11/03/2020 04:30 (YAKT)

6, 7, 8, 9]. While the performance results for many task6s5536 Tue Feb 24 12:35:05 EEST 2022

are quickly improving –on average–, there is a high variance in the results depending on the particular task, 32the JJuuln2end1,,220012901C3M:3T7:1387:0(E7:D2T6) instances, and the prompt1s0][. For a given task, one 4 5 can partially deal with the variability across instan.c..es Apr 02 '17 : 0856 MDT 20:12:20 GST (02/17/21) Date & Time Stored as Text

Date & Time (CEST) 15/03/17 2:09:37 AM 2/07/19 19:37:37 PM 1/06/19 06:07:26 PM 02/04/02 05:08:56 PM 20/12/20 06:17:21 AM 14/03/20 11:15:45 AM 11/03/20 4:30:30 AM 24/02/22 11:35:05 AM 0.875 0.835 0.307 0.769 0.455 0.910 0.124 OUT 0.870 using the model’s decision when the probability of it6s5536 Tue Feb 24 12:35:05 EEST 2022 11/03/2020 04:30 (YAKT) answer (i.e., its “confidence”) falls below a certain threshold [11, 12, 13]. This requires a good calibration of thFeigure 1: (Top) Process of a LM generating the solution for model. However, even if LMs were well calibrated, anda date transformation (repetitive) problem in a spreadsheet. they are generally no1t0[, 14], it would also still require actually running the inference. For large LMs this comiensstances (rows 2 & onward). (Bottom) Process of an assessor at a non-negligible cost per token, either in requiredthina-t can reliably predict beforehand the performance of the

Once the user prompts one instance of the desired transformation (row 1), the LM proceeds to transforming the rest of LM at the instance-level.

nEvelop-O LGOBE

0000-0003-1161-4270 (L. Zhou); 0000-0003-2902-6477 the calculation of whether it is actually worth asking the • We find that lightweight assessors can give reliable instance-level predictions of the performance of large LMs.

LM for an answer, depending on factors such as the valouef supplied examples. For instance, 5-shot inference is of a correct result, the cost of running the model, andusoufally better than 2-shot inference, but requires more course the performance estimated by the assessor. Seefoert from the user.

Figure1 for an illustrative example. However, on many occasions the cost of running LMs

This paper describes a (successful) attempt at buildinigs not negligible in both computation2a2l,[23] and ecosuch an assessor for a collection of large LMs consistinngomic terms [24]. Large LMs, open source or not, all have of various scales of GPT-33[] and BIG-G [10] models steep development costs in common. A recent stu2d4y][ for application to a diverse set of data-wrangling tapsuktss. the cost of developing a LM with only 1.5 billion Data-wrangling1[6, 17] is a notoriously time consuming parameters at $1.6 million. Inference costs is another data preparation chore where LMs have recently shodwrnain. [25] estimates the cost of running GPT-3, if run in promising results1[8]. We discuss data-wrangling andthe cloud, at a minimum of $87,000 per year, with current the considerations regarding the use of LMs in sectiAoPnI price for Davinci being 6 cents per 750 wor1.dOsf 2.1 and2.2. course, these costs go down quickly as compute becomes cheaper, but larger models are expected to replace the

Contributions old ones quickly to set the new state of the art. Also, as To our knowledge, this is the first paper analysing LMs increase their performance, their penetration rates assessors applied to the language domain, and to a will increase, becoming widespread in billions of semiplausible use case in general. Additionally, automated operations in many domains, and compute might easily become more of an issue, not less.

2.2. Data Wrangling • We find that their predictions are well-calibrated and unbiased, again comparable to the self-assessment of the LMs.

Data-wrangling1[6, 17] is a data preparation task that data janitors, data scientists and other people operating with forms, spreadsheets and other data formatting situations consider a very monotonous and laborious part • We investigate the contributions of various of their jobs.Data wrangling can require as much as 80 faesasteussreosr lpikeerfo#rsmhoatnscae.nd #parameters to percent of their time26[], including tediously transforming data presented from heterogeneous formats into a standardised format for eficient access, understanding, and analysis. One of the challenges in data-wrangling automation consists of selecting the correct (string) trans2. Background formations from the vast set of possible ones, and doing so by only having seen a few examples2[7]. Many apIn this section we revisit some key ideas of LMs, theiprroaches have attempted to address this challenge by costs, their applications to the data wrangling probrleemd,ucing the transformation space through the incorpoand the traditional (post-hoc) reject option. We also sruamt-ion of prior knowledg2e8[, 29]. This led to a many marise the main elements of the recently introduced cotno-ols that use domain-specific languages or needing adcept of assessor models. hoc solutions 3[0].

Because LMs capture vast amounts of human knowl2.1. (Large) Language Models edge across many diferent domains, they can be specially efective for more open-ended tasks, and as such data In less than a decade, research in Natural Language Pwror-angling is recognised in data science automat3io1n]. [ cessing (NLP) has been overturned by the appearancUesing few-shot inference3[, 32, 33], LMs have shown of a suite of LMs trained in an unsupervised manner opnromising yet unreliable results for data wrangling. For very large corpora. LMs are capturing more and moirnestance, in1[8] GPT-3 Davinci (prompted, not finetuned) of the information in natural language, including the lainc-hieves a56% accuracy in the 1-shot settin68g%, with guistic characteristics of various human languages antdhe 4-shot setting, and almo9s0t% with 10 shots. Addiassociated knowledge. Moreover, these models can bteionally, as opposed to LM results on other tasks, GPT-3 adapted (e.g., through fine-tuning) to a wide range oisf also relatively well calibrated in the data-wrangling downstream tasks8][. Recent LMs such as GPT-3 3[], task, reporting a Brier score of 0.11 (see sect4)io.n PanGu- [19], GLaM [20] and OPT [21] have excelled at few-shot inference, where a task is solved by supplying a small set of correct examples formatted apsroampt. The quality of the completion usually depends on the number 1https://openai.com/api/pricing/ 2.3. Reject option

3. Methods

Given these unreliable accuracies but good calibratioInnthis section we identify the experimental setting, inscores, we could have a more reliable and efective usecluding goals of the analysis, the data sources and how of these systems by not using those for which the confi-they are converted into evaluation records, and how we dence of the system is low. In other words, if we knowbuild the assessor from them. for which instances the LM is (likely to be) wrong, we can abstain from using the output of the LM in these cases. This is called a ‘reject option’, and a classic an3d.1. Experimental Questions straightforward implementation for it is to use a coWnefi- set three experimental questions: dence threshol d and compare it with the probability (| )̂ that a mode l assigns to its output.̂This repre- Q1: Can we build lightweight yet good assessors for sents the self-assigned probability of being correct (i.e., language models in this domain? its confidence) [11, 12, 13]. We set to match the error tnoolteruasencteheofotuhtepuutseofcatshee, maonddewl, hu(es| nu)̂a<lly dele, gwaetidnog to aQ2: lAarnegutahgeeamsosedseslosrwshoefnceosmtpimaartaibnleg pqruoabliatbyiltitoietsh?e human. However, this classical interpretation of the reQ-3: What features from the systems and the instances ject rule still requires running the model. As mentioned are most relevant for predicting success and conbefore, this can be expensive for large LMs. Whenever sequently for building good assessors? the reject rule triggers, it is not only that humans need to do the task manually, but we have also incurred a cost in the computation of a model that is efectively waste3d..2. Data Sources and Train-Test Split

We work with the Data Wrangling Dataset Repos2i,tory 2.4. Assessors containing 119 tasks from 7 domains (dates, emails, free text, names, phones, times, and units). In particular, we Assessor models 1[5] provide an externaalnticipative use results (at instance level) from multiple LMs obtained reject option instead. Assessors are conditional profrboam- two diferent evaluation eforts. First3,4[] have bility (or density) estimat o( r|̂,s) that are trained onproduced granular results of the evaluation of diferent evaluation data. With ‘evaluation data’ we mean a sevteorfsions of GPT-3. We have 146k instances available for evaluation record⟨s, , ⟩ , where refers to a profile or GPT-3 models Ada (350M), Babbage (1.3B), Curie (6.7B), description of a particular system (e.g., deployment coann-d Davinci (175B), from 0-shot to 10-shot. More inditions, state, system architecture, or hyperparametfeorrsm),ation about the architectures can be foun3d].in [ refers to a particular instance (e.g., a prompt), taond Second, [10] provides results on the same benchmark for an empirical measurement of the performance oofn . a collection of Google LMs of various parameter sizes.

Assessor models are meant to act as general mappingHsere we extract 86k instances, fro0-mshot to3-shot, between the space of systems, the space of instancefos,r 22 models with parameter sizes ranging from 2M to and the corresponding distribution of scores. They ar1e2a8B across two diferent model families, a decoder-only way of capturing all available evaluation informationdinense transformer (BIG-G dense) and a sparse Mixturea single predictive model that could be used, e.g., to inof--Experts 3[5] model (BIG-G sparse). More information vestigate what features make an instance dificult, to adodn the BIG-G network architectures is available10i]n. [ confidence capabilities to systems that do not have theAm,ll models (GPT-3 and BIG-G variants) were queried with or to select the optimal model for a specific instance. Intemperature set to 0, and none of them were fine-tuned this case, we focus on their use to provide an anticipfoar- the data-wrangling task. tive reject option: wh e n̂is built and shown to be an As the assessor is trained on a somewhat heterogeaccurate estimator, we can use it to make inferencesnoenous collection of systems and instances, we have to be the expected performan c( ê= 1|, ) given a system careful to define a train-test partition of the evaluation and instance (or a collection of those). results without contamination or information leakage.

We do still have to run actual inference on the assessToor,this purpose, we must ensure that the same instances but as we show in the experiments, they have the possair-e consistently used across systems and shots. For exbility of being multiple orders of magnitude smaller thaanmple, we have to avoid that the result of BIG-G dense the LMs, allowing us to cheaply avoid any LM inferencweith 2-shots on instanceis in the training set, while that is doomed to fail. GPT-3 Ada’s result with 0-shots on the same instance is in the test set. Figur2eshows a visual representation of the partition requirements that ensure that this does not

3 for an example). We refer to29[] for an overview. The

binary metafeatures are available for all input and output that is in the prompt, so for example for a 2-shot prompt, we would have 2 inputs and 2 outputs from the examples, and 1 input for the actual question, tota5ll⋅in54g = 270 features.

Prompted data 3.3. Anatomy of the Evaluation Record happen. The order-matched train-test partition leadFsotrothe data wrangling tasks, all scores are binary: 1 if 194k training instances and 38k testing instances. the output of the LM matches the target string exactly, and 0 otherwise. The score is what the assessors must predict, and thus acts as a label during training.

From our two data sources, we receive records of the shape ⟨system id, #shots, prompt, sco⟩r.eWe further an- 3.4. Assessor Building and Evaluation notate this record with features describing the sysFtoemr the assessor model, we train a Random For3e6s]t [ ( ), and extract meta features of the inst an)cteha(t are of 100 decision trees, a minimum node size of 5, and ift for tabular representation (as opposed to free fosremlect randomly 50% of the available variables in each text). In the end, this creates a general record of the shsappliet, tuned through grid search on a validation set us⟨, , ⟩ = ⟨⟨ system feature⟩s,⟨instance featur⟩e,sscore⟩. ing 84%-16% training-validation sp4liftrom the training We describe these features in detail below, but ultimateselyt defined previously. For the remaining hyperparamethe only constraint for making a useful assessor is thtaetrs the defaults were us5e.dWe report the Area Under all system and instance features are available withoRuetceiver Operating Characteristic Curve (AUROC) and actually running the original model. Brier Score (BS), as well as its decomposition into calibration and refinement loss3[7, 38, 39] 3.3.1. System features As a baseline to compare assessors to, we take the standard approach of interpreting the proba(b| i)̂lity The available system features include a system id that

the LM assigns to its outp û tas the “confidence” of the refers to a specific trained LM, i.e., a set of learned pa

model, i.e. its self-assessed probability of being correct. rameters fitting a certain architecture, the id of thatHoawr-ever, there is no da t( a|)̂ recorded in the BIGchitecture (either GPT-3, BIG-G sparse, or BIG-G dense),

bench logs, so we cannot compare the assessor AUROC whether a model is dense or sparse, and the number of

or BS to those of the BIG-G family of models. For GPT-3 parameters. These features will of course be the same

this information is available. for all records of the same trained model. 3.3.2. Instance features

4. Results and Discussion

Instance features include the number of shots, the idSoi nfce it is assumed that all models give better results with the prompt-templat3,eand 54 simple binary metafeatures + 1 shots than wit hshots, Figure4 shows the accuthat can be automatically extracted through simple rergauc-ies of the LMs, with the maximum number of shots lar expressions from the original text. Examples include the kind of symbols the instance contains (e.g., numbers, 4The non-standard train-test partition is the result of the indots, dashes) or whether it starts with a digit (see Figsutarnece matching procedure described in sect3i.o2n. 5The RandomForest package (https://cran.r-project.org/web/ 3The prompt-template difers between1[0] and [34], but the packages/randomForest/index.ht)mwlas used for training the assessame metafeatures can be extracted. sor model. used in BIG-G data (3-shots), the same number of shotTsable 2 for GPT-3 (for comparability), and the maximum usedAUROC and BS (Calibration, Refinement) for BIG-G data (10-shots) on the original data-wrangling tasks. Despuitsieng a single assessor trained with both GPT-3 and BIG-G the promising progress in the state-of-the-art capadbailtia- using all available features (except prompt-template id). ties of LMs, they still struggle to master data-wranglIinntghe bottom row, AUROC and BS are not averaged, but tasks with very few shots. For 3-shot inference, BIG-cGalculated from the aggregated set of instances. The average dense 128B achieves an accuracy of 0.776, outperformingaccuracies of the LM (with std. dev. across #shots) on the original data-wrangling task are also presented, and serve as BIG-G sparse 8B and GPT-3 175B. When it comes to 10-an indication of the class distribution the assessor has to deal shots (only GPT-3 was available), GPT-3 175B achieves awith. promising accuracy of nearly 90%, outperforming other GPT-3 variants. id LM Acc.

Table 1 describes the AUROC and BS —decomposed into calibration loss (CAL) and refinement loss (REF)— for the GPT-3 data given by an assessor trained wiitnhstance features on the performance of the assessor. all features available except for the prompt-templateAnalysing the results in Tab1leand Table2, we see id, along with the GPT-3’s self-assessment on the samreelatively good results overall in the assessor’s perforset of instances. Tabl2edescribes the AUROC and BS mance, reporting AUROCs of around 0.9, and BSs around —decomposed into calibration loss (CAL) and refinement0.12 (Q1). It should be noted that the metrics for the loss (REF)— for the BIG-G data given by the assessosrm.allest LMs have to be interpreted cautiously due to the Finally, Table3 shows the impact of various system andsignificant imbalance in LM scores distribution (i.e., for Table 3 tecture is not indicative of major performance diferences Ablation study of the impact of various features on assessor (or the assessor fails to pick up on them). performance. The 54 instance metafeatures are always in- Finally, we discuss a concrete example using the ascluded. Row 4, in italics, indicates the assessor we reported in sessor to implement a reject rule (see Ta?b?le). For the Table 1 and Table 2. GPT-3 data in the test set (24604 instances), we take a systemid#parametteemrsplatefiadm. &sp#asrhs.otsAUROC ( ̂) BS (CAL, REF) ( ̂) craeesesjeedcs.tsTotrhhedreeaseshmsoessldistoo1irsf%lr,eeis.jese.,tctwhsae1na%rbelojiekucettly5int34hst0eaiLnnMsctewsaonwuchledesr,seuwcth-hiceh 1 • 0.909 0.130 (0.015, 0.115) account for21.7% of the instances and (approximately) 32 •• • •• 00..991102 00..112370 ((00..001145,, 00..111135)) the total compute. From these 5340, we have that 5114 4 • • • • 0.916 0.128 (0.017, 0.111) are correctly rejected, represent4i6n%gof the failures, 65 •• • •• 00..991176 00..112266 ((00..001155,, 00..111111)) at the cost of only 226 correct answers being rejected 7 • • • 0.916 0.128 (0.016, 0.112) (about1.5%). 89 • • •• 00..886689 00..116687 ((00..003310,, 00..113377)) Therefore, a lot of compute, money, and emissions 10 • 0.865 0.170 (0.033, 0.137) would be saved since the assessor is far smaller than the LMs in terms of parameters and inference time. Concretely, the proposed assessor has 100 decision trees of very low accuracies it is easy to predict that the LM w(aipllproximately) 20000 nodes, whose inference time is in usually fail). We also see that, in general, performanctehe order o1f00 ⋅ log2 (20000) ≈ 1450 comparisons, much is worse for the BIG-G models than for the GPT-3 onessm.aller than what LMs required for one pass through its This could be due to GPT-3 being more predictable, orbillions parameters. from the availability of more data for GPT-3 (possibly making the assessor pay more attention to the majToarb-le 4 ity model family in its generalisation). This observatioCnonfusion matrix with reject threshold < 0.01 of assessor suggests that the distribution of results of each systperemdictions for GPT-3. The 0 and 1 represent wrong and correct afects the performance of the assessor accordingly. If wreesponses by the LM respectively. would like to focus on building an assessor for a specific Actual LM, techniques like instance weights or oversampling failure correct couClodmhpaavreinagntehfeecste. results with GPT-3’s self-assessment Predicted fcaoirlruercet 65010144 12324681 in Table1, we can conclude that the assessor performs slightly worse than GPT-3, but is definitely comparable (Q2). A significant part of the diference in BS comes from the calibration (CAL) term, and not from the 5re.- Conclusions and Future work ifnement (REF) term, which is very similar for the LMs and the assessor, especially for the smaller versions Wofe have illustrated how a small assessor can manage GPT-3. This suggests that post-hoc calibration methopdesrformance expectations at a level that is comparable [40] like isotonic regression could still improve resulttso the self-assessment of giant language models with significantly. billions of parameters. We have shown the assessor can

In the feature importance study in Ta3b,lewe can be well calibrated and make refined predictions. We find see that using either the system id or the number tohfat the assessor picks up on system features like id or # parameters improves performance significantly, likelyparameters that explain large variances in performance. because both can indicate the scale of the system, whicWhe showcase how they can be used to reject instances highly correlates with performance. The use of system ibdefore running much larger language models, resulting generalises slightly worse than #parameters. Other fienaa- significant saving of compute. tures, like #shots, prompt-template id, or model family There are of course some limitations to this work. For and sparsity indicators have less efect on the perfoerx-ample, the instance metafeatures are specific to the mance (Q3). The assessor can easily derive the #shotussed data-wrangling tasks. Nonetheless, the positive from the input (more examples results in more featurersesults hint at future work. There are still many ways of being present), so this makes sense. We did not measurdeirectly improving the assessor we have used here. For any efect on aggregated performance from the diferentinstance, we could use post-hoc calibration with methods prompt-templates, and it is likely this feature is simpslyuch as isotonic regression, or add instance weights to non-informative. Regarding model family and sparsittyh,e results of systems we especially care about. There are we hypothesise that there is a large overlap betweaelnso many questions to further investigate. Do assessors which instances the LMs solve correctly, so model archwi-ork for other tasks? Can we use a small LM instead of a random forest to allow free form input? What i[s4] E. Kharitonov, A. Lee, A. Polyak, Y. Adi, the agreement between diferent systems, and with the J. Copet, K. Lakhotia, T.-A. Nguyen, M. Rivière, assessor? A. Mohamed, E. Dupoux, W.-N. Hsu, Text

These future ideas could be useful from the perspective Free Prosody-Aware Generative Spoken Language of saving computing costs as we outlined before, but the Modeling, arXiv:2109.03264 [cs, eess] (2021). schema is of wider applicability. There is a lot of useful arXiv:2109.03264. information generated during the evaluation process th[5a]t J. W. Rae, S. Borgeaud, T. Cai, K. Millican, J. Hofis lost upon aggregation. Assessors are an attempt at mann, F. Song, J. Aslanides, S. Henderson, R. Ring, capturing this information and providing expectation S. Young, et al., Scaling language models: Methods, management that is external, fine grained, anticipative, analysis & insights from training Gopher, arXiv and can make use of population data. We could use them preprint arXiv:2112.11446 (2021). as instance-level model selectors, or we might be able[6] D. Hendrycks, C. Burns, S. Basart, A. Zou, apply explainability techniques on the assessor to find M. Mazeika, D. Song, J. Steinhardt, Measuring out what makes an instance dificult. massive multitask language understanding, arXiv

There is definitely more to explore around the topic of preprint arXiv:2009.03300 (2020). assessors, which perform granular assessments beyond[7] D. Hendrycks, S. Basart, S. Kadavath, M. Mazeika, generic aggregated results: saving compute by rejecting A. Arora, E. Guo, C. Burns, S. Puranik, H. He, examples where the original model is going to fail is an D. Song, J. Steinhardt, Measuring coding challenge important illustrative application. competence with APPS, 2021a.rXiv:2105.09938. [8] R. Bommasani, et al., On the opportunities and risks of foundation models, arXiv preprint Acknowledgments arXiv:2108.07258, 2021.

[9] L. Ouyang, J. Wu, X. Jiang, D. Almeida, C. L. WainWe thank the anonymous reviewers for their comments. wright, P. Mishkin, C. Zhang, S. Agarwal, K. Slama, This work has been partially supported by the Norwe- A. Ray, et al., Training language models to follow gian Research Council grant 329745 Machine Teach- instructions with human feedback, arXiv preprint ing for Explainable AI, also by the EU (FEDER) and arXiv:2203.02155 (2022).

Spanish MINECO grant RTI2018-094403-B-C32 funded [10] A. Srivastava, A. Rastogi, et al., Beyond the imiby MCIN/AEI/10.13039/501100011033 and by “ERDF A tation game: Quantifying and extrapolating the way of making Europe”, Generalitat Valenciana under capabilities of language models, 2022. URLh:ttps: grant PROMETEO/2019/098, EU’s Horizon 2020 research and innovation programme under grant agreement No. //arxiv.org/abs/2206.0461.5 doi:10.48550/ARXIV. 952215 (TAILOR), US DARPA HR00112120007 (RECoG- 2206.04615.

AI), and INNEST/2021/317 (Project cofunded by the Eu-[11] R. Herbei, M. H. Wegkamp, Classification with reject option, The Canadian Journal of Statistics/La ropean Union with the “Programa Operativo del Fondo Revue Canadienne de Statistique (2006) 709–721. Europeo de Desarrollo Regional (FEDER) de la Comunitat

[12] F. Tortorella, An optimal reject rule for binary clasValenciana 2014-2020”) and ”the UPV (Vicerrectorado de sifiers, in: Joint IAPR International Workshops on Investigación) grant PAI-10-21”. Statistical Techniques in Pattern Recognition (SPR) and Structural and Syntactic Pattern Recognition References (SSPR), Springer, 2000, pp. 611–620.

[13] K. Hendrickx, L. Perini, D. Van der Plas, W. Meert, [1] J. Devlin, M.-W. Chang, K. Lee, K. Toutanova, J. Davis, Machine learning with a reject option: A Bert: Pre-training of deep bidirectional transform- survey, arXiv preprint arXiv:2107.11277 (2021). ers for language understanding, arXiv prepri[n1t4] Z. Jiang, J. Araki, H. Ding, G. Neubig, How arXiv:1810.04805 (2018). Can We Know When Language Models Know? [2] C. Rafel, N. Shazeer, A. Roberts, K. Lee, S. Narang, On the Calibration of Language Models for QuesM. Matena, Y. Zhou, W. Li, P. J. Liu, Exploring the tion Answering, Transactions of the Association limits of transfer learning with a unified text-to- for Computational Linguistics 9 (2021) 962–977. text transformer, arXiv preprint arXiv:1910.10683 doi:10.1162/tacl_a_00407.

(2019). [15] J. Hernández-Orallo, W. Schellaert, F. Martınez[3] T. B. Brown, B. Mann, N. Ryder, M. Subbiah, J. Ka- Plumed, Training on the test set: Mapping the plan, P. Dhariwal, A. Neelakantan, P. Shyam, G. Sas- system-problem space in AI, Proceedings of the try, A. Askell, et al., Language models are few-shot AAAI Conference on Artificial Intelligence (2022). learners, in: Advances in Neural Information Pr[o1-6] S. Kandel, J. Heer, C. Plaisant, J. Kennedy, cessing Systems, volume 33, 2020, pp. 1877–1901. F. Van Ham, N. H. Riche, C. Weaver, B. Lee, D. Brod