Reject Before You Run: Small Assessors Anticipate Big Language Models

Reject Before You Run: Small Assessors Anticipate Big Language Models LexinZhou VRAIN Valencian Research Institute for Artificial Intelligence Universitat Politècnica de València FernandoMartínez-Plumed VRAIN Valencian Research Institute for Artificial Intelligence Universitat Politècnica de València JoséHernández-Orallo VRAIN Valencian Research Institute for Artificial Intelligence Universitat Politècnica de València Leverhulme Centre for the Future of Intelligence University of Cambridge CèsarFerri VRAIN Valencian Research Institute for Artificial Intelligence Universitat Politècnica de València WoutSchellaert VRAIN Valencian Research Institute for Artificial Intelligence Universitat Politècnica de València Reject Before You Run: Small Assessors Anticipate Big Language Models 1613-0073 73AC8C695AF70B0FF0D22609B939817E GROBID - A machine learning software for extracting information from scholarly documents Assessor Anticipative Reject Option Language Model Data Wrangling AI Evaluation Instance Granularity

Large Language Models (LMs) are expensive to operate. It would be more frugal to avoid querying them when results are predictably bad. In this paper we therefore investigate whether it is possible to granularly predict the performance of these large LMs with a much smaller external model, the assessor, which is trained on evaluation results. For instance, given an input prompt, can an assessor estimate the probability of correct completion by a giant like GPT-3 Davinci (175B parameters)? Using a data-wrangling task included in the BIG-bench repository as a case study, we find it is indeed possible, and we report results that are comparable in accuracy and calibration to the LM itself. This suggests that, at least for some tasks, a lot of compute, money, and emissions could be spared through the assessor's anticipative reject option. It also suggests that assessors can capture meaningful extra information from the evaluation procedure, and as such, could be a useful complement to simple aggregate metrics.

Introduction

Extensive experimental research on Language Models (LM) keeps showing remarkable results across several domains including mathematics, question answering, language understanding, and code generation [1,2,3,4,5,6,7,8,9]. While the performance results for many tasks are quickly improving -on average-, there is a high variance in the results depending on the particular task, the instances, and the prompts [10]. For a given task, one can partially deal with the variability across instances through a traditional reject rule, where we abstain from using the model's decision when the probability of its answer (i.e., its "confidence") falls below a certain threshold [11,12,13]. This requires a good calibration of the model. However, even if LMs were well calibrated, and they are generally not [10,14], it would also still require actually running the inference. For large LMs this comes at a non-negligible cost per token, either in required in-EBeM '22: Workshop on AI Evaluation Beyond Metrics, July 25, 2022, Vienna, Austria Envelope lzhou@inf.upv.es (L. Zhou); fermarpl@dsic.upv.es (F. Martínez-Plumed); jorallo@dsic.upv.es (J. Hernández-Orallo); cferri@dsic.upv.es (C. Ferri); wschell@vrain.upv.es (W. Schellaert) GLOBE https://lexzhou.github.io/ (L. Zhou); https://nandomp.github.io/ (F. Martínez-Plumed); http://josephorallo.webs.upv.es/ (J. Hernández-Orallo); http://personales.upv.es/ceferra/ (C. Ferri); https://schellaert.org/ (W. Schellaert) Orcid 0000-0003-1161-4270 (L. Zhou); 0000-0003-2902-6477 (F. Martínez-Plumed); 0000-0001-9746-7632 (J. Hernández-Orallo); 0000-0002-8975-1120 (C. Ferri); 0000-0002-9182-4747 (W. Schellaert) Once the user prompts one instance of the desired transformation (row 1), the LM proceeds to transforming the rest of instances (rows 2 & onward). (Bottom) Process of an assessor that can reliably predict beforehand the performance of the LM at the instance-level.

frastructure or through the price of the API. To avoid being wasteful, we explore how much we can anticipate the level of success for a particular instance (or collection of instances), without running it through the LM at all.

For this, we need an assessor: an external conditional probability (or density) estimator that can reliably predict beforehand the performance of an LM at instance granularity [15]. With a good assessor, we could make the calculation of whether it is actually worth asking the LM for an answer, depending on factors such as the value of a correct result, the cost of running the model, and of course the performance estimated by the assessor. See Figure 1 for an illustrative example.

This paper describes a (successful) attempt at building such an assessor for a collection of large LMs consisting of various scales of GPT-3 [3] and BIG-G [10] models for application to a diverse set of data-wrangling tasks. Data-wrangling [16,17] is a notoriously time consuming data preparation chore where LMs have recently shown promising results [18]. We discuss data-wrangling and the considerations regarding the use of LMs in section 2.1 and 2.2.

To our knowledge, this is the first paper analysing assessors applied to the language domain, and to a plausible use case in general. Additionally,

• We find that lightweight assessors can give reliable instance-level predictions of the performance of large LMs.

• We find that their predictions are well-calibrated and unbiased, again comparable to the self-assessment of the LMs.

• We investigate the contributions of various features like #shots and #parameters to assessor performance.

Contributions

Background

In this section we revisit some key ideas of LMs, their costs, their applications to the data wrangling problem, and the traditional (post-hoc) reject option. We also summarise the main elements of the recently introduced concept of assessor models.

(Large) Language Models

In less than a decade, research in Natural Language Processing (NLP) has been overturned by the appearance of a suite of LMs trained in an unsupervised manner on very large corpora. LMs are capturing more and more of the information in natural language, including the linguistic characteristics of various human languages and associated knowledge. Moreover, these models can be adapted (e.g., through fine-tuning) to a wide range of downstream tasks [8]. Recent LMs such as GPT-3 [3], PanGu-𝛼 [19], GLaM [20] and OPT [21] have excelled at few-shot inference, where a task is solved by supplying a small set of correct examples formatted as a prompt. The quality of the completion usually depends on the number of supplied examples. For instance, 5-shot inference is usually better than 2-shot inference, but requires more effort from the user. However, on many occasions the cost of running LMs is not negligible in both computational [22,23] and economic terms [24]. Large LMs, open source or not, all have steep development costs in common. A recent study [24] puts the cost of developing a LM with only 1.5 billion parameters at $1.6 million. Inference costs is another drain. [25] estimates the cost of running GPT-3, if run in the cloud, at a minimum of $87,000 per year, with current API price for Davinci being 6 cents per 750 words 1 . Of course, these costs go down quickly as compute becomes cheaper, but larger models are expected to replace the old ones quickly to set the new state of the art. Also, as LMs increase their performance, their penetration rates will increase, becoming widespread in billions of semiautomated operations in many domains, and compute might easily become more of an issue, not less.

Data Wrangling

Data-wrangling [16,17] is a data preparation task that data janitors, data scientists and other people operating with forms, spreadsheets and other data formatting situations consider a very monotonous and laborious part of their jobs.Data wrangling can require as much as 80 percent of their time [26], including tediously transforming data presented from heterogeneous formats into a standardised format for efficient access, understanding, and analysis. One of the challenges in data-wrangling automation consists of selecting the correct (string) transformations from the vast set of possible ones, and doing so by only having seen a few examples [27]. Many approaches have attempted to address this challenge by reducing the transformation space through the incorporation of prior knowledge [28,29]. This led to a many tools that use domain-specific languages or needing adhoc solutions [30].

Because LMs capture vast amounts of human knowledge across many different domains, they can be specially effective for more open-ended tasks, and as such data wrangling is recognised in data science automation [31]. Using few-shot inference [3,32,33], LMs have shown promising yet unreliable results for data wrangling. For instance, in [18] GPT-3 Davinci (prompted, not finetuned) achieves a 56% accuracy in the 1-shot setting, 68% with the 4-shot setting, and almost 90% with 10 shots. Additionally, as opposed to LM results on other tasks, GPT-3 is also relatively well calibrated in the data-wrangling task, reporting a Brier score of 0.11 (see section 4).

Reject option

Given these unreliable accuracies but good calibration scores, we could have a more reliable and effective use of these systems by not using those for which the confidence of the system is low. In other words, if we know for which instances the LM is (likely to be) wrong, we can abstain from using the output of the LM in these cases. This is called a 'reject option', and a classic and straightforward implementation for it is to use a confidence threshold 𝑡 and compare it with the probability 𝑝( ŷ |𝑥) that a model 𝑝 assigns to its output ŷ . This represents the self-assigned probability of being correct (i.e., its confidence) [11,12,13]. We set 𝑡 to match the error tolerance of the use case, and when 𝑝( ŷ |𝑥) < 𝑡 , we do not use the output of the model, usually delegating to a human. However, this classical interpretation of the reject rule still requires running the model. As mentioned before, this can be expensive for large LMs. Whenever the reject rule triggers, it is not only that humans need to do the task manually, but we have also incurred a cost in the computation of a model that is effectively wasted.

Assessors

Assessor models [15] provide an external anticipative reject option instead. Assessors are conditional probability (or density) estimators R (𝑟|𝜋, 𝜇) that are trained on evaluation data. With 'evaluation data' we mean a set of evaluation records ⟨𝜋, 𝜇, 𝑟⟩, where 𝜋 refers to a profile or description of a particular system (e.g., deployment conditions, state, system architecture, or hyperparameters), 𝜇 refers to a particular instance (e.g., a prompt), and 𝑟 to an empirical measurement of the performance of 𝜋 on 𝜇.

Assessor models are meant to act as general mappings between the space of systems, the space of instances, and the corresponding distribution of scores. They are a way of capturing all available evaluation information in a single predictive model that could be used, e.g., to investigate what features make an instance difficult, to add confidence capabilities to systems that do not have them, or to select the optimal model for a specific instance. In this case, we focus on their use to provide an anticipative reject option: when R is built and shown to be an accurate estimator, we can use it to make inferences on the expected performance R (𝑟 = 1|𝜋, 𝜇) given a system 𝜋 and instance 𝜇 (or a collection of those).

We do still have to run actual inference on the assessor, but as we show in the experiments, they have the possibility of being multiple orders of magnitude smaller than the LMs, allowing us to cheaply avoid any LM inference that is doomed to fail.

Methods

In this section we identify the experimental setting, including goals of the analysis, the data sources and how they are converted into evaluation records, and how we build the assessor from them.

Experimental Questions

We set three experimental questions:

Q1: Can we build lightweight yet good assessors for language models in this domain?

Q2: Are the assessors of comparable quality to the language models when estimating probabilities?

Q3: What features from the systems and the instances are most relevant for predicting success and consequently for building good assessors?

Data Sources and Train-Test Split

We work with the Data Wrangling Dataset Repository2 , containing 119 tasks from 7 domains (dates, emails, free text, names, phones, times, and units). In particular, we use results (at instance level) from multiple LMs obtained from two different evaluation efforts. First, [34] have produced granular results of the evaluation of different versions of GPT-3. We have 146k instances available for GPT-3 models Ada (350M), Babbage (1.3B), Curie (6.7B), and Davinci (175B), from 0-shot to 10-shot. More information about the architectures can be found in [3].

Second, [10] provides results on the same benchmark for a collection of Google LMs of various parameter sizes.

Here we extract 86k instances, from 0-shot to 3-shot, for 22 models with parameter sizes ranging from 2M to 128B across two different model families, a decoder-only dense transformer (BIG-G dense) and a sparse Mixtureof-Experts [35] model (BIG-G sparse). More information on the BIG-G network architectures is available in [10].

All models (GPT-3 and BIG-G variants) were queried with temperature set to 0, and none of them were fine-tuned for the data-wrangling task.

As the assessor is trained on a somewhat heterogeneous collection of systems and instances, we have to be careful to define a train-test partition of the evaluation results without contamination or information leakage. To this purpose, we must ensure that the same instances are consistently used across systems and shots. For example, we have to avoid that the result of BIG-G dense with 2-shots on instance 𝑖 is in the training set, while GPT-3 Ada's result with 0-shots on the same instance is in the test set. Figure 2 shows a visual representation of the partition requirements that ensure that this does not

Anatomy of the Evaluation Record

From our two data sources, we receive records of the shape ⟨system id, #shots, prompt, score⟩. We further annotate this record with features describing the system (𝜋), and extract meta features of the instance (𝜇) that are fit for tabular representation (as opposed to free form text). In the end, this creates a general record of the shape ⟨𝜋, 𝜇, 𝑟⟩ = ⟨⟨system features⟩, ⟨instance features⟩, score⟩. We describe these features in detail below, but ultimately the only constraint for making a useful assessor is that all system and instance features are available without actually running the original model.

System features

The available system features include a system id that refers to a specific trained LM, i.e., a set of learned parameters fitting a certain architecture, the id of that architecture (either GPT-3, BIG-G sparse, or BIG-G dense), whether a model is dense or sparse, and the number of parameters. These features will of course be the same for all records of the same trained model.

Instance features

Instance features include the number of shots, the id of the prompt-template 3 , and 54 simple binary metafeatures that can be automatically extracted through simple regular expressions from the original text. Examples include the kind of symbols the instance contains (e.g., numbers, dots, dashes) or whether it starts with a digit (see Figure

Score

For the data wrangling tasks, all scores are binary: 1 if the output of the LM matches the target string exactly, and 0 otherwise. The score is what the assessors must predict, and thus acts as a label during training.

Assessor Building and Evaluation

For the assessor model, we train a Random Forest [36] of 100 decision trees, a minimum node size of 5, and select randomly 50% of the available variables in each split, tuned through grid search on a validation set using 84%-16% training-validation split 4 from the training set defined previously. For the remaining hyperparameters the defaults were used 5 . We report the Area Under Receiver Operating Characteristic Curve (AUROC) and Brier Score (BS), as well as its decomposition into calibration and refinement loss [37,38,39] As a baseline to compare assessors to, we take the standard approach of interpreting the probability 𝑝( ŷ |𝑥) the LM assigns to its output ŷ as the "confidence" of the model, i.e. its self-assessed probability of being correct. However, there is no data 𝑝( ŷ |𝑥) recorded in the BIGbench logs, so we cannot compare the assessor AUROC or BS to those of the BIG-G family of models. For GPT-3 this information is available.

Results and Discussion

Since it is assumed that all models give better results with 𝑛 + 1 shots than with 𝑛 shots, Figure 4 shows the accuracies of the LMs, with the maximum number of shots Table 1 AUROC and BS (Calibration, Refinement) in the GPT-3 data, for a single assessor trained with both GPT-3 and BIG-G data using all available features (except prompt-template id), alongside the self-estimation from GPT-3 LMs. In the bottom row, AUROC and BS are not averaged, but calculated from the aggregated set of instances. The average accuracies from GPT-3 LMs (with std. dev. across #shots) on the original data-wrangling task are also presented, and serve as an indication of the class distribution the assessor has to deal with. Table 1 describes the AUROC and BS -decomposed into calibration loss (CAL) and refinement loss (REF)for the GPT-3 data given by an assessor trained with all features available except for the prompt-template id, along with the GPT-3's self-assessment on the same set of instances. Table 2 describes the AUROC and BS -decomposed into calibration loss (CAL) and refinement loss (REF)-for the BIG-G data given by the assessor. Finally, Table 3 shows the impact of various system and Table 2 AUROC and BS (Calibration, Refinement) for BIG-G data using a single assessor trained with both GPT-3 and BIG-G data using all available features (except prompt-template id). In the bottom row, AUROC and BS are not averaged, but calculated from the aggregated set of instances. The average accuracies of the LM (with std. dev. across #shots) on the original data-wrangling task are also presented, and serve as an indication of the class distribution the assessor has to deal with. instance features on the performance of the assessor.

Analysing the results in Table 1 and Table 2, we see relatively good results overall in the assessor's performance, reporting AUROCs of around 0.9, and BSs around 0.12 (Q1). It should be noted that the metrics for the smallest LMs have to be interpreted cautiously due to the significant imbalance in LM scores distribution (i.e., for very low accuracies it is easy to predict that the LM will usually fail). We also see that, in general, performance is worse for the BIG-G models than for the GPT-3 ones. This could be due to GPT-3 being more predictable, or from the availability of more data for GPT-3 (possibly making the assessor pay more attention to the majority model family in its generalisation). This observation suggests that the distribution of results of each system affects the performance of the assessor accordingly. If we would like to focus on building an assessor for a specific LM, techniques like instance weights or oversampling could have an effect. Comparing these results with GPT-3's self-assessment in Table 1, we can conclude that the assessor performs slightly worse than GPT-3, but is definitely comparable (Q2). A significant part of the difference in BS comes from the calibration (CAL) term, and not from the refinement (REF) term, which is very similar for the LMs and the assessor, especially for the smaller versions of GPT-3. This suggests that post-hoc calibration methods [40] like isotonic regression could still improve results significantly.

In the feature importance study in Table 3, we can see that using either the system id or the number of parameters improves performance significantly, likely because both can indicate the scale of the system, which highly correlates with performance. The use of system id generalises slightly worse than #parameters. Other features, like #shots, prompt-template id, or model family and sparsity indicators have less effect on the performance (Q3). The assessor can easily derive the #shots from the input (more examples results in more features being present), so this makes sense. We did not measure any effect on aggregated performance from the different prompt-templates, and it is likely this feature is simply non-informative. Regarding model family and sparsity, we hypothesise that there is a large overlap between which instances the LMs solve correctly, so model archi-tecture is not indicative of major performance differences (or the assessor fails to pick up on them).

Finally, we discuss a concrete example using the assessor to implement a reject rule (see Table ??). For the GPT-3 data in the test set (24604 instances), we take a reject threshold of 1%, i.e., we reject instances where the assessor deems it is less than 1% likely the LM would succeed. The assessor rejects about 5340 instances, which account for 21.7% of the instances and (approximately) the total compute. From these 5340, we have that 5114 are correctly rejected, representing 46% of the failures, at the cost of only 226 correct answers being rejected (about 1.5%).

Therefore, a lot of compute, money, and emissions would be saved since the assessor is far smaller than the LMs in terms of parameters and inference time. Concretely, the proposed assessor has 100 decision trees of (approximately) 20000 nodes, whose inference time is in the order of 100 ⋅ log 2 (20000) ≈ 1450 comparisons, much smaller than what LMs required for one pass through its billions parameters.

Table 4

Confusion matrix with reject threshold < 0.01 of assessor predictions for GPT-3. The 0 and 1 represent wrong and correct responses by the LM respectively.

Conclusions and Future work

We have illustrated how a small assessor can manage performance expectations at a level that is comparable to the self-assessment of giant language models with billions of parameters. We have shown the assessor can be well calibrated and make refined predictions. We find that the assessor picks up on system features like id or # parameters that explain large variances in performance. We showcase how they can be used to reject instances before running much larger language models, resulting in a significant saving of compute. There are of course some limitations to this work. For example, the instance metafeatures are specific to the used data-wrangling tasks. Nonetheless, the positive results hint at future work. There are still many ways of directly improving the assessor we have used here. For instance, we could use post-hoc calibration with methods such as isotonic regression, or add instance weights to the results of systems we especially care about. There are also many questions to further investigate. Do assessors work for other tasks? Can we use a small LM instead of a random forest to allow free form input? What is the agreement between different systems, and with the assessor?

These future ideas could be useful from the perspective of saving computing costs as we outlined before, but the schema is of wider applicability. There is a lot of useful information generated during the evaluation process that is lost upon aggregation. Assessors are an attempt at capturing this information and providing expectation management that is external, fine grained, anticipative, and can make use of population data. We could use them as instance-level model selectors, or we might be able apply explainability techniques on the assessor to find out what makes an instance difficult.

There is definitely more to explore around the topic of assessors, which perform granular assessments beyond generic aggregated results: saving compute by rejecting examples where the original model is going to fail is an important illustrative application.

Figure 1 :1Figure 1: (Top) Process of a LM generating the solution for a date transformation (repetitive) problem in a spreadsheet.Once the user prompts one instance of the desired transformation (row 1), the LM proceeds to transforming the rest of instances (rows 2 & onward). (Bottom) Process of an assessor that can reliably predict beforehand the performance of the LM at the instance-level.

Figure 2 :2Figure 2: Illustration of the matching requirements for making a train-test partition for the assessor. Each column represents a data wrangling prompt used to evaluate a LM. Orange columns represent instances included in the training set for the assessor, while green represents those included in the test set for the assessor. To avoid contamination, the same instances should be used across different shots and systems.

Figure 3 :3Figure 3: Example of metafeatures that can be extracted from the examples of different domains (dates and emails in the figure). Adapted from [29].

Figure 4 :4Figure 4: LMs' accuracies per LM and size on the original data-wrangling task. Logarithmic scale used on the 𝑥-axis.

Table 33Ablation study of the impact of various features on assessor performance. The 54 instance metafeatures are always included. Row 4, in italics, indicates the assessor we reported in Table1 and Table 2.s y s t e m i d # p a r a m e t e r s t e m p l a t e i d f a m . & s p a r s . # s h o t sAUROC ( R )BS (CAL, REF) ( R )1•0.9090.130 (0.015, 0.115)2••0.9100.130 (0.015, 0.115)3•••0.9120.127 (0.014, 0.113)4••••0.9160.128 (0.017, 0.111)5•••0.9170.126 (0.015, 0.111)6••0.9160.126 (0.015, 0.111)7•••0.9160.128 (0.016, 0.112)8••0.8690.167 (0.030, 0.137)9••0.8680.168 (0.031, 0.137)10•0.8650.170 (0.033, 0.137)

https://openai.com/api/pricing/ http://dmip.webs.upv.es/datawrangling/ The prompt-template differs between[10] and[34], but the same metafeatures can be extracted.3 for an example). We refer to[29] for an overview. The binary metafeatures are available for all input and output that is in the prompt, so for example for a 2-shot prompt, we would have 2 inputs and 2 outputs from the examples, and 1 input for the actual question, totalling 5 ⋅ 54 = 270 features. The non-standard train-test partition is the result of the instance matching procedure described in section 3.2. 5 The RandomForest package (https://cran.r-project.org/web/ packages/randomForest/index.html) was used for training the assessor model.

Acknowledgments

We thank the anonymous reviewers for their comments. This work has been partially supported by the Norwegian Research Council grant 329745 Machine Teaching for Explainable AI, also by the EU (FEDER) and Spanish MINECO grant RTI2018-094403-B-C32 funded by MCIN/AEI/10.13039/501100011033 and by "ERDF A way of making Europe", Generalitat Valenciana under grant PROMETEO/2019/098, EU's Horizon 2020 research and innovation programme under grant agreement No. 952215 (TAILOR), US DARPA HR00112120007 (RECoG-AI), and INNEST/2021/317 (Project cofunded by the European Union with the "Programa Operativo del Fondo Europeo de Desarrollo Regional (FEDER) de la Comunitat Valenciana 2014-2020") and "the UPV (Vicerrectorado de Investigación) grant PAI-10-21".

JDevlin M.-WChang KLee KToutanova arXiv:1810.04805 Bert: Pre-training of deep bidirectional transformers for language understanding 2018 arXiv preprint CRaffel NShazeer ARoberts KLee SNarang MMatena YZhou WLi PJLiu arXiv:1910.10683 Exploring the limits of transfer learning with a unified text-totext transformer 2019 arXiv preprint Language models are few-shot learners TBBrown BMann NRyder MSubbiah JKaplan PDhariwal ANeelakantan PShyam GSastry AAskell Advances in Neural Information Processing Systems 33 2020 EKharitonov ALee APolyak YAdi JCopet KLakhotia T.-ANguyen MRivière AMohamed EDupoux W.-NHsu arXiv:2109.03264 arXiv:2109.03264 Text-Free Prosody-Aware Generative Spoken Language Modeling 2021 cs, eess JWRae SBorgeaud TCai KMillican JHoffmann FSong JAslanides SHenderson RRing SYoung arXiv:2112.11446 Scaling language models: Methods, analysis & insights from training Gopher 2021 arXiv preprint DHendrycks CBurns SBasart AZou MMazeika DSong JSteinhardt arXiv:2009.03300 Measuring massive multitask language understanding 2020 arXiv preprint DHendrycks SBasart SKadavath MMazeika AArora EGuo CBurns SPuranik HHe DSong JSteinhardt arXiv:2105.09938 Measuring coding challenge competence with APPS 2021 RBommasani arXiv:2108.07258 On the opportunities and risks of foundation models 2021 arXiv preprint LOuyang JWu XJiang DAlmeida CLWainwright PMishkin CZhang SAgarwal KSlama ARay arXiv:2203.02155 Training language models to follow instructions with human feedback 2022 arXiv preprint Beyond the imitation game: Quantifying and extrapolating the capabilities of language models ASrivastava ARastogi 10.48550/ARXIV.2206.04615 2022 Classification with reject option RHerbei MHWegkamp The Canadian Journal of Statistics/La Revue Canadienne de Statistique 2006 An optimal reject rule for binary classifiers FTortorella Joint IAPR International Workshops on Statistical Techniques in Pattern Recognition (SPR) and Structural and Syntactic Pattern Recognition (SSPR) Springer 2000 KHendrickx LPerini DVan Der Plas WMeert JDavis arXiv:2107.11277 Machine learning with a reject option: A survey 2021 arXiv preprint How Can We Know When Language Models Know? On the Calibration of Language Models for Question Answering ZJiang JAraki HDing GNeubig 10.1162/tacl_a_00407 Transactions of the Association for Computational Linguistics 9 2021 Training on the test set: Mapping the system-problem space in AI JHernández-Orallo WSchellaert FMartınez-Plumed Proceedings of the AAAI Conference on Artificial Intelligence the AAAI Conference on Artificial Intelligence 2022 Research directions in data wrangling: Visualizations and transformations for usable and credible data SKandel JHeer CPlaisant JKennedy FVan Ham NHRiche CWeaver BLee DBrodbeck PBuono Information Visualization 10 2011 Data wrangling for big data: Challenges and opportunities TFurche GGottlob LLibkin GOrsi NPaton Proceedings of the 19th International Conference on Extending Database Technology the 19th International Conference on Extending Database Technology 2016. 2016 Advances in Database Technology-EDBT Comparison between machine learning and human learning from examples generated with machine teaching GJaimovitchLópez 2020 WZeng XRen TSu HWang YLiao ZWang XJiang ZYang KWang XZhang arXiv:2104.12369 Pangu-𝛼: Large-scale autoregressive pretrained chinese language models with auto-parallel computation 2021 arXiv preprint NDu YHuang AMDai STong DLepikhin YXu MKrikun YZhou AWYu OFirat BZoph LFedus MBosma ZZhou TWang YEWang KWebster MPellat KRobinson KMeier-Hellstern TDuke LDixon KZhang QVLe YWu ZChen CCui arXiv:2112.06905[cs arXiv:2112.06905 GLaM: Efficient Scaling of Language Models with Mixture-of-Experts 2021 SZhang SRoller NGoyal MArtetxe MChen SChen CDewan MDiab XLi XVLin TMihaylov MOtt SShleifer KShuster DSimig PSKoura ASridhar TWang LZettlemoyer arXiv:2205.01068[cs arXiv:2205.01068 OPT: Open Pre-trained Transformer Language Models 2022 Scaling laws for neural language models JKaplan SMccandlish THenighan TBBrown BChess RChild SGray ARadford JWu DAmodei CoRR abs/2001.08361 2020 RDesislavov FMartínez-Plumed JHernández-Orallo arXiv:2109.05472 Compute and energy consumption trends in deep learning inference 2021 arXiv preprint OSharir BPeleg YShoham arXiv:2004.08900 The cost of training NLP models: A concise overview 2020 arXiv preprint BDickson The GPT-3 economy 2020 DSteinberg How much time needs to be spent preparing data for analysis? 2013 Towards automatic data format transformations: Data wrangling at scale ABogatu NWPaton AAFernandes British International Conference on Databases Springer 2017 Inductive programming meets the real world SGulwani JHernández-Orallo EKitzelmann SHMuggleton USchmid BZorn Communications of the ACM 58 2015 Automated data transformation with inductive programming and dynamic background knowledge LContreras-Ochando CFerri JHernández-Orallo FMartínez-Plumed MJRamírez-Quintana SKatayama Joint European Conference on Machine Learning and Knowledge Discovery in Databases Springer 2019 Wrangler: Interactive visual specification of data transformation scripts SKandel APaepcke JHellerstein JHeer Proceedings of the sigchi conference on human factors in computing systems the sigchi conference on human factors in computing systems 2011 Automating data science TDe Bie LDe Raedt JHernández-Orallo HHHoos PSmyth CK IWilliams 10.1145/3495256 Communications of the ACM 65 2022 RPuri BCatanzaro arXiv:1912.10165 Zero-shot text classification with generative language models 2019 arXiv preprint Exploiting cloze questions for few shot text classification and natural language inference TSchick HSchütze arXiv:2001.07676 2020 arXiv preprint Can language models automate data wrangling? GJaimovitch-Lopez CFerri JHernandez-Orallo FMartınez-Plumed MJRamırez-Quintana Workshop on Automating Datascience at ECML-PKDD 2021 13 BZoph IBello SKumar NDu YHuang JDean NShazeer WFedus arXiv:2202.08906 Designing effective sparse expert models 2022 arXiv preprint Random forests LBreiman Machine learning 45 2001 A New Vector Partition of the Probability Score AMurphy Journal of Applied Meteorology 12 1973 On classification, ranking, and probability estimation PFlach EMatsubara Dagstuhl Seminar Proceedings Schloss Dagstuhl-Leibniz-Zentrum fr Informatik 2008 A unified view of performance metrics: Translating threshold choice into expected classification loss JHernández-Orallo PFlach CFerriRamírez Journal of Machine Learning Research 13 2012 Calibration of machine learning models ABella CFerri JHernández-Orallo MJRamírez-Quintana Handbook of Research on Machine Learning Applications and Trends: Algorithms, Methods, and Techniques IGI Global 2010