<?xml version="1.0" encoding="UTF-8"?>
<TEI xml:space="preserve" xmlns="http://www.tei-c.org/ns/1.0" 
xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" 
xsi:schemaLocation="http://www.tei-c.org/ns/1.0 https://raw.githubusercontent.com/kermitt2/grobid/master/grobid-home/schemas/xsd/Grobid.xsd"
 xmlns:xlink="http://www.w3.org/1999/xlink">
	<teiHeader xml:lang="en">
		<fileDesc>
			<titleStmt>
				<title level="a" type="main">Reject Before You Run: Small Assessors Anticipate Big Language Models</title>
			</titleStmt>
			<publicationStmt>
				<publisher/>
				<availability status="unknown"><licence/></availability>
			</publicationStmt>
			<sourceDesc>
				<biblStruct>
					<analytic>
						<author>
							<persName><forename type="first">Lexin</forename><surname>Zhou</surname></persName>
							<affiliation key="aff0">
								<orgName type="department">VRAIN</orgName>
								<orgName type="laboratory">Valencian Research Institute for Artificial Intelligence</orgName>
								<orgName type="institution">Universitat Politècnica de València</orgName>
							</affiliation>
						</author>
						<author>
							<persName><forename type="first">Fernando</forename><surname>Martínez-Plumed</surname></persName>
							<affiliation key="aff0">
								<orgName type="department">VRAIN</orgName>
								<orgName type="laboratory">Valencian Research Institute for Artificial Intelligence</orgName>
								<orgName type="institution">Universitat Politècnica de València</orgName>
							</affiliation>
						</author>
						<author>
							<persName><forename type="first">José</forename><surname>Hernández-Orallo</surname></persName>
							<affiliation key="aff0">
								<orgName type="department">VRAIN</orgName>
								<orgName type="laboratory">Valencian Research Institute for Artificial Intelligence</orgName>
								<orgName type="institution">Universitat Politècnica de València</orgName>
							</affiliation>
							<affiliation key="aff1">
								<orgName type="department">Leverhulme Centre for the Future of Intelligence</orgName>
								<orgName type="institution">University of Cambridge</orgName>
							</affiliation>
						</author>
						<author>
							<persName><forename type="first">Cèsar</forename><surname>Ferri</surname></persName>
							<affiliation key="aff0">
								<orgName type="department">VRAIN</orgName>
								<orgName type="laboratory">Valencian Research Institute for Artificial Intelligence</orgName>
								<orgName type="institution">Universitat Politècnica de València</orgName>
							</affiliation>
						</author>
						<author>
							<persName><forename type="first">Wout</forename><surname>Schellaert</surname></persName>
							<affiliation key="aff0">
								<orgName type="department">VRAIN</orgName>
								<orgName type="laboratory">Valencian Research Institute for Artificial Intelligence</orgName>
								<orgName type="institution">Universitat Politècnica de València</orgName>
							</affiliation>
						</author>
						<title level="a" type="main">Reject Before You Run: Small Assessors Anticipate Big Language Models</title>
					</analytic>
					<monogr>
						<idno type="ISSN">1613-0073</idno>
					</monogr>
					<idno type="MD5">73AC8C695AF70B0FF0D22609B939817E</idno>
				</biblStruct>
			</sourceDesc>
		</fileDesc>
		<encodingDesc>
			<appInfo>
				<application version="0.7.2" ident="GROBID" when="2023-03-24T08:32+0000">
					<desc>GROBID - A machine learning software for extracting information from scholarly documents</desc>
					<ref target="https://github.com/kermitt2/grobid"/>
				</application>
			</appInfo>
		</encodingDesc>
		<profileDesc>
			<textClass>
				<keywords>
					<term>Assessor</term>
					<term>Anticipative Reject Option</term>
					<term>Language Model</term>
					<term>Data Wrangling</term>
					<term>AI Evaluation</term>
					<term>Instance Granularity</term>
				</keywords>
			</textClass>
			<abstract>
<div xmlns="http://www.tei-c.org/ns/1.0"><p>Large Language Models (LMs) are expensive to operate. It would be more frugal to avoid querying them when results are predictably bad. In this paper we therefore investigate whether it is possible to granularly predict the performance of these large LMs with a much smaller external model, the assessor, which is trained on evaluation results. For instance, given an input prompt, can an assessor estimate the probability of correct completion by a giant like GPT-3 Davinci (175B parameters)? Using a data-wrangling task included in the BIG-bench repository as a case study, we find it is indeed possible, and we report results that are comparable in accuracy and calibration to the LM itself. This suggests that, at least for some tasks, a lot of compute, money, and emissions could be spared through the assessor's anticipative reject option. It also suggests that assessors can capture meaningful extra information from the evaluation procedure, and as such, could be a useful complement to simple aggregate metrics.</p></div>
			</abstract>
		</profileDesc>
	</teiHeader>
	<text xml:lang="en">
		<body>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="1.">Introduction</head><p>Extensive experimental research on Language Models (LM) keeps showing remarkable results across several domains including mathematics, question answering, language understanding, and code generation <ref type="bibr" target="#b0">[1,</ref><ref type="bibr" target="#b1">2,</ref><ref type="bibr" target="#b2">3,</ref><ref type="bibr" target="#b3">4,</ref><ref type="bibr" target="#b4">5,</ref><ref type="bibr" target="#b5">6,</ref><ref type="bibr" target="#b6">7,</ref><ref type="bibr" target="#b7">8,</ref><ref type="bibr" target="#b8">9]</ref>. While the performance results for many tasks are quickly improving -on average-, there is a high variance in the results depending on the particular task, the instances, and the prompts <ref type="bibr" target="#b9">[10]</ref>. For a given task, one can partially deal with the variability across instances through a traditional reject rule, where we abstain from using the model's decision when the probability of its answer (i.e., its "confidence") falls below a certain threshold <ref type="bibr" target="#b10">[11,</ref><ref type="bibr" target="#b11">12,</ref><ref type="bibr" target="#b12">13]</ref>. This requires a good calibration of the model. However, even if LMs were well calibrated, and they are generally not <ref type="bibr" target="#b9">[10,</ref><ref type="bibr" target="#b13">14]</ref>, it would also still require actually running the inference. For large LMs this comes at a non-negligible cost per token, either in required in-EBeM <ref type="bibr">'22:</ref> Workshop on AI Evaluation Beyond Metrics, July 25, 2022, Vienna, Austria Envelope lzhou@inf.upv.es (L. Zhou); fermarpl@dsic.upv.es (F. Martínez-Plumed); jorallo@dsic.upv.es (J. Hernández-Orallo); cferri@dsic.upv.es (C. Ferri); wschell@vrain.upv.es (W. Schellaert) GLOBE https://lexzhou.github.io/ (L. Zhou); https://nandomp.github.io/ (F. Martínez-Plumed); http://josephorallo.webs.upv.es/ (J. Hernández-Orallo); http://personales.upv.es/ceferra/ (C. Ferri); https://schellaert.org/ (W. Schellaert) Orcid 0000-0003-1161-4270 (L. Zhou); 0000-0003-2902-6477 (F. Martínez-Plumed); 0000-0001-9746-7632 (J. Hernández-Orallo); 0000-0002-8975-1120 (C. Ferri); 0000-0002-9182-4747 (W. Schellaert) Once the user prompts one instance of the desired transformation (row 1), the LM proceeds to transforming the rest of instances (rows 2 &amp; onward). (Bottom) Process of an assessor that can reliably predict beforehand the performance of the LM at the instance-level.</p><p>frastructure or through the price of the API. To avoid being wasteful, we explore how much we can anticipate the level of success for a particular instance (or collection of instances), without running it through the LM at all.</p><p>For this, we need an assessor: an external conditional probability (or density) estimator that can reliably predict beforehand the performance of an LM at instance granularity <ref type="bibr" target="#b14">[15]</ref>. With a good assessor, we could make the calculation of whether it is actually worth asking the LM for an answer, depending on factors such as the value of a correct result, the cost of running the model, and of course the performance estimated by the assessor. See Figure <ref type="figure" target="#fig_0">1</ref> for an illustrative example.</p><p>This paper describes a (successful) attempt at building such an assessor for a collection of large LMs consisting of various scales of GPT-3 <ref type="bibr" target="#b2">[3]</ref> and BIG-G <ref type="bibr" target="#b9">[10]</ref> models for application to a diverse set of data-wrangling tasks. Data-wrangling <ref type="bibr" target="#b15">[16,</ref><ref type="bibr" target="#b16">17]</ref> is a notoriously time consuming data preparation chore where LMs have recently shown promising results <ref type="bibr" target="#b17">[18]</ref>. We discuss data-wrangling and the considerations regarding the use of LMs in section 2.1 and 2.2.</p><p>To our knowledge, this is the first paper analysing assessors applied to the language domain, and to a plausible use case in general. Additionally,</p><p>• We find that lightweight assessors can give reliable instance-level predictions of the performance of large LMs.</p><p>• We find that their predictions are well-calibrated and unbiased, again comparable to the self-assessment of the LMs.</p><p>• We investigate the contributions of various features like #shots and #parameters to assessor performance.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head>Contributions</head></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="2.">Background</head><p>In this section we revisit some key ideas of LMs, their costs, their applications to the data wrangling problem, and the traditional (post-hoc) reject option. We also summarise the main elements of the recently introduced concept of assessor models.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="2.1.">(Large) Language Models</head><p>In less than a decade, research in Natural Language Processing (NLP) has been overturned by the appearance of a suite of LMs trained in an unsupervised manner on very large corpora. LMs are capturing more and more of the information in natural language, including the linguistic characteristics of various human languages and associated knowledge. Moreover, these models can be adapted (e.g., through fine-tuning) to a wide range of downstream tasks <ref type="bibr" target="#b7">[8]</ref>. Recent LMs such as GPT-3 <ref type="bibr" target="#b2">[3]</ref>, PanGu-𝛼 <ref type="bibr" target="#b18">[19]</ref>, GLaM <ref type="bibr" target="#b19">[20]</ref> and OPT <ref type="bibr" target="#b20">[21]</ref> have excelled at few-shot inference, where a task is solved by supplying a small set of correct examples formatted as a prompt. The quality of the completion usually depends on the number of supplied examples. For instance, 5-shot inference is usually better than 2-shot inference, but requires more effort from the user. However, on many occasions the cost of running LMs is not negligible in both computational <ref type="bibr" target="#b21">[22,</ref><ref type="bibr" target="#b22">23]</ref> and economic terms <ref type="bibr" target="#b23">[24]</ref>. Large LMs, open source or not, all have steep development costs in common. A recent study <ref type="bibr" target="#b23">[24]</ref> puts the cost of developing a LM with only 1.5 billion parameters at $1.6 million. Inference costs is another drain. <ref type="bibr" target="#b24">[25]</ref> estimates the cost of running GPT-3, if run in the cloud, at a minimum of $87,000 per year, with current API price for Davinci being 6 cents per 750 words <ref type="foot" target="#foot_0">1</ref> . Of course, these costs go down quickly as compute becomes cheaper, but larger models are expected to replace the old ones quickly to set the new state of the art. Also, as LMs increase their performance, their penetration rates will increase, becoming widespread in billions of semiautomated operations in many domains, and compute might easily become more of an issue, not less.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="2.2.">Data Wrangling</head><p>Data-wrangling <ref type="bibr" target="#b15">[16,</ref><ref type="bibr" target="#b16">17]</ref> is a data preparation task that data janitors, data scientists and other people operating with forms, spreadsheets and other data formatting situations consider a very monotonous and laborious part of their jobs.Data wrangling can require as much as 80 percent of their time <ref type="bibr" target="#b25">[26]</ref>, including tediously transforming data presented from heterogeneous formats into a standardised format for efficient access, understanding, and analysis. One of the challenges in data-wrangling automation consists of selecting the correct (string) transformations from the vast set of possible ones, and doing so by only having seen a few examples <ref type="bibr" target="#b26">[27]</ref>. Many approaches have attempted to address this challenge by reducing the transformation space through the incorporation of prior knowledge <ref type="bibr" target="#b27">[28,</ref><ref type="bibr" target="#b28">29]</ref>. This led to a many tools that use domain-specific languages or needing adhoc solutions <ref type="bibr" target="#b29">[30]</ref>.</p><p>Because LMs capture vast amounts of human knowledge across many different domains, they can be specially effective for more open-ended tasks, and as such data wrangling is recognised in data science automation <ref type="bibr" target="#b30">[31]</ref>. Using few-shot inference <ref type="bibr" target="#b2">[3,</ref><ref type="bibr" target="#b31">32,</ref><ref type="bibr" target="#b32">33]</ref>, LMs have shown promising yet unreliable results for data wrangling. For instance, in <ref type="bibr" target="#b17">[18]</ref> GPT-3 Davinci (prompted, not finetuned) achieves a 56% accuracy in the 1-shot setting, 68% with the 4-shot setting, and almost 90% with 10 shots. Additionally, as opposed to LM results on other tasks, GPT-3 is also relatively well calibrated in the data-wrangling task, reporting a Brier score of 0.11 (see section 4).</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="2.3.">Reject option</head><p>Given these unreliable accuracies but good calibration scores, we could have a more reliable and effective use of these systems by not using those for which the confidence of the system is low. In other words, if we know for which instances the LM is (likely to be) wrong, we can abstain from using the output of the LM in these cases. This is called a 'reject option', and a classic and straightforward implementation for it is to use a confidence threshold 𝑡 and compare it with the probability 𝑝( ŷ |𝑥) that a model 𝑝 assigns to its output ŷ . This represents the self-assigned probability of being correct (i.e., its confidence) <ref type="bibr" target="#b10">[11,</ref><ref type="bibr" target="#b11">12,</ref><ref type="bibr" target="#b12">13]</ref>. We set 𝑡 to match the error tolerance of the use case, and when 𝑝( ŷ |𝑥) &lt; 𝑡 , we do not use the output of the model, usually delegating to a human. However, this classical interpretation of the reject rule still requires running the model. As mentioned before, this can be expensive for large LMs. Whenever the reject rule triggers, it is not only that humans need to do the task manually, but we have also incurred a cost in the computation of a model that is effectively wasted.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="2.4.">Assessors</head><p>Assessor models <ref type="bibr" target="#b14">[15]</ref> provide an external anticipative reject option instead. Assessors are conditional probability (or density) estimators R (𝑟|𝜋, 𝜇) that are trained on evaluation data. With 'evaluation data' we mean a set of evaluation records ⟨𝜋, 𝜇, 𝑟⟩, where 𝜋 refers to a profile or description of a particular system (e.g., deployment conditions, state, system architecture, or hyperparameters), 𝜇 refers to a particular instance (e.g., a prompt), and 𝑟 to an empirical measurement of the performance of 𝜋 on 𝜇.</p><p>Assessor models are meant to act as general mappings between the space of systems, the space of instances, and the corresponding distribution of scores. They are a way of capturing all available evaluation information in a single predictive model that could be used, e.g., to investigate what features make an instance difficult, to add confidence capabilities to systems that do not have them, or to select the optimal model for a specific instance. In this case, we focus on their use to provide an anticipative reject option: when R is built and shown to be an accurate estimator, we can use it to make inferences on the expected performance R (𝑟 = 1|𝜋, 𝜇) given a system 𝜋 and instance 𝜇 (or a collection of those).</p><p>We do still have to run actual inference on the assessor, but as we show in the experiments, they have the possibility of being multiple orders of magnitude smaller than the LMs, allowing us to cheaply avoid any LM inference that is doomed to fail.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="3.">Methods</head><p>In this section we identify the experimental setting, including goals of the analysis, the data sources and how they are converted into evaluation records, and how we build the assessor from them.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="3.1.">Experimental Questions</head><p>We set three experimental questions:</p><p>Q1: Can we build lightweight yet good assessors for language models in this domain?</p><p>Q2: Are the assessors of comparable quality to the language models when estimating probabilities?</p><p>Q3: What features from the systems and the instances are most relevant for predicting success and consequently for building good assessors?</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="3.2.">Data Sources and Train-Test Split</head><p>We work with the Data Wrangling Dataset Repository<ref type="foot" target="#foot_1">2</ref> , containing 119 tasks from 7 domains (dates, emails, free text, names, phones, times, and units). In particular, we use results (at instance level) from multiple LMs obtained from two different evaluation efforts. First, <ref type="bibr" target="#b33">[34]</ref> have produced granular results of the evaluation of different versions of GPT-3. We have 146k instances available for GPT-3 models Ada (350M), Babbage (1.3B), Curie (6.7B), and Davinci (175B), from 0-shot to 10-shot. More information about the architectures can be found in <ref type="bibr" target="#b2">[3]</ref>.</p><p>Second, <ref type="bibr" target="#b9">[10]</ref> provides results on the same benchmark for a collection of Google LMs of various parameter sizes.</p><p>Here we extract 86k instances, from 0-shot to 3-shot, for 22 models with parameter sizes ranging from 2M to 128B across two different model families, a decoder-only dense transformer (BIG-G dense) and a sparse Mixtureof-Experts <ref type="bibr" target="#b34">[35]</ref> model (BIG-G sparse). More information on the BIG-G network architectures is available in <ref type="bibr" target="#b9">[10]</ref>.</p><p>All models (GPT-3 and BIG-G variants) were queried with temperature set to 0, and none of them were fine-tuned for the data-wrangling task.</p><p>As the assessor is trained on a somewhat heterogeneous collection of systems and instances, we have to be careful to define a train-test partition of the evaluation results without contamination or information leakage. To this purpose, we must ensure that the same instances are consistently used across systems and shots. For example, we have to avoid that the result of BIG-G dense with 2-shots on instance 𝑖 is in the training set, while GPT-3 Ada's result with 0-shots on the same instance is in the test set. Figure <ref type="figure" target="#fig_1">2</ref> shows a visual representation of the partition requirements that ensure that this does not </p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="3.3.">Anatomy of the Evaluation Record</head><p>From our two data sources, we receive records of the shape ⟨system id, #shots, prompt, score⟩. We further annotate this record with features describing the system (𝜋), and extract meta features of the instance (𝜇) that are fit for tabular representation (as opposed to free form text). In the end, this creates a general record of the shape ⟨𝜋, 𝜇, 𝑟⟩ = ⟨⟨system features⟩, ⟨instance features⟩, score⟩. We describe these features in detail below, but ultimately the only constraint for making a useful assessor is that all system and instance features are available without actually running the original model.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="3.3.1.">System features</head><p>The available system features include a system id that refers to a specific trained LM, i.e., a set of learned parameters fitting a certain architecture, the id of that architecture (either GPT-3, BIG-G sparse, or BIG-G dense), whether a model is dense or sparse, and the number of parameters. These features will of course be the same for all records of the same trained model.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="3.3.2.">Instance features</head><p>Instance features include the number of shots, the id of the prompt-template <ref type="foot" target="#foot_2">3</ref> , and 54 simple binary metafeatures that can be automatically extracted through simple regular expressions from the original text. Examples include the kind of symbols the instance contains (e.g., numbers, dots, dashes) or whether it starts with a digit (see Figure </p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="3.3.3.">Score</head><p>For the data wrangling tasks, all scores are binary: 1 if the output of the LM matches the target string exactly, and 0 otherwise. The score is what the assessors must predict, and thus acts as a label during training.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="3.4.">Assessor Building and Evaluation</head><p>For the assessor model, we train a Random Forest <ref type="bibr" target="#b35">[36]</ref> of 100 decision trees, a minimum node size of 5, and select randomly 50% of the available variables in each split, tuned through grid search on a validation set using 84%-16% training-validation split <ref type="foot" target="#foot_3">4</ref> from the training set defined previously. For the remaining hyperparameters the defaults were used <ref type="foot" target="#foot_4">5</ref> . We report the Area Under Receiver Operating Characteristic Curve (AUROC) and Brier Score (BS), as well as its decomposition into calibration and refinement loss <ref type="bibr" target="#b36">[37,</ref><ref type="bibr" target="#b37">38,</ref><ref type="bibr" target="#b38">39]</ref> As a baseline to compare assessors to, we take the standard approach of interpreting the probability 𝑝( ŷ |𝑥) the LM assigns to its output ŷ as the "confidence" of the model, i.e. its self-assessed probability of being correct. However, there is no data 𝑝( ŷ |𝑥) recorded in the BIGbench logs, so we cannot compare the assessor AUROC or BS to those of the BIG-G family of models. For GPT-3 this information is available.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="4.">Results and Discussion</head><p>Since it is assumed that all models give better results with 𝑛 + 1 shots than with 𝑛 shots, Figure <ref type="figure" target="#fig_3">4</ref> shows the accuracies of the LMs, with the maximum number of shots Table <ref type="table">1</ref> AUROC and BS (Calibration, Refinement) in the GPT-3 data, for a single assessor trained with both GPT-3 and BIG-G data using all available features (except prompt-template id), alongside the self-estimation from GPT-3 LMs. In the bottom row, AUROC and BS are not averaged, but calculated from the aggregated set of instances. The average accuracies from GPT-3 LMs (with std. dev. across #shots) on the original data-wrangling task are also presented, and serve as an indication of the class distribution the assessor has to deal with.  Table <ref type="table">1</ref> describes the AUROC and BS -decomposed into calibration loss (CAL) and refinement loss (REF)for the GPT-3 data given by an assessor trained with all features available except for the prompt-template id, along with the GPT-3's self-assessment on the same set of instances. Table <ref type="table">2</ref> describes the AUROC and BS -decomposed into calibration loss (CAL) and refinement loss (REF)-for the BIG-G data given by the assessor. Finally, Table <ref type="table" target="#tab_2">3</ref> shows the impact of various system and Table <ref type="table">2</ref> AUROC and BS (Calibration, Refinement) for BIG-G data using a single assessor trained with both GPT-3 and BIG-G data using all available features (except prompt-template id). In the bottom row, AUROC and BS are not averaged, but calculated from the aggregated set of instances. The average accuracies of the LM (with std. dev. across #shots) on the original data-wrangling task are also presented, and serve as an indication of the class distribution the assessor has to deal with. instance features on the performance of the assessor.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head>id</head><p>Analysing the results in Table <ref type="table">1</ref> and Table <ref type="table">2</ref>, we see relatively good results overall in the assessor's performance, reporting AUROCs of around 0.9, and BSs around 0.12 (Q1). It should be noted that the metrics for the smallest LMs have to be interpreted cautiously due to the significant imbalance in LM scores distribution (i.e., for very low accuracies it is easy to predict that the LM will usually fail). We also see that, in general, performance is worse for the BIG-G models than for the GPT-3 ones. This could be due to GPT-3 being more predictable, or from the availability of more data for GPT-3 (possibly making the assessor pay more attention to the majority model family in its generalisation). This observation suggests that the distribution of results of each system affects the performance of the assessor accordingly. If we would like to focus on building an assessor for a specific LM, techniques like instance weights or oversampling could have an effect. Comparing these results with GPT-3's self-assessment in Table <ref type="table">1</ref>, we can conclude that the assessor performs slightly worse than GPT-3, but is definitely comparable (Q2). A significant part of the difference in BS comes from the calibration (CAL) term, and not from the refinement (REF) term, which is very similar for the LMs and the assessor, especially for the smaller versions of GPT-3. This suggests that post-hoc calibration methods <ref type="bibr" target="#b39">[40]</ref> like isotonic regression could still improve results significantly.</p><p>In the feature importance study in Table <ref type="table" target="#tab_2">3</ref>, we can see that using either the system id or the number of parameters improves performance significantly, likely because both can indicate the scale of the system, which highly correlates with performance. The use of system id generalises slightly worse than #parameters. Other features, like #shots, prompt-template id, or model family and sparsity indicators have less effect on the performance (Q3). The assessor can easily derive the #shots from the input (more examples results in more features being present), so this makes sense. We did not measure any effect on aggregated performance from the different prompt-templates, and it is likely this feature is simply non-informative. Regarding model family and sparsity, we hypothesise that there is a large overlap between which instances the LMs solve correctly, so model archi-tecture is not indicative of major performance differences (or the assessor fails to pick up on them).</p><p>Finally, we discuss a concrete example using the assessor to implement a reject rule (see Table ??). For the GPT-3 data in the test set (24604 instances), we take a reject threshold of 1%, i.e., we reject instances where the assessor deems it is less than 1% likely the LM would succeed. The assessor rejects about 5340 instances, which account for 21.7% of the instances and (approximately) the total compute. From these 5340, we have that 5114 are correctly rejected, representing 46% of the failures, at the cost of only 226 correct answers being rejected (about 1.5%).</p><p>Therefore, a lot of compute, money, and emissions would be saved since the assessor is far smaller than the LMs in terms of parameters and inference time. Concretely, the proposed assessor has 100 decision trees of (approximately) 20000 nodes, whose inference time is in the order of 100 ⋅ log 2 (20000) ≈ 1450 comparisons, much smaller than what LMs required for one pass through its billions parameters.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head>Table 4</head><p>Confusion matrix with reject threshold &lt; 0.01 of assessor predictions for GPT-3. The 0 and 1 represent wrong and correct responses by the LM respectively. </p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="5.">Conclusions and Future work</head><p>We have illustrated how a small assessor can manage performance expectations at a level that is comparable to the self-assessment of giant language models with billions of parameters. We have shown the assessor can be well calibrated and make refined predictions. We find that the assessor picks up on system features like id or # parameters that explain large variances in performance. We showcase how they can be used to reject instances before running much larger language models, resulting in a significant saving of compute. There are of course some limitations to this work. For example, the instance metafeatures are specific to the used data-wrangling tasks. Nonetheless, the positive results hint at future work. There are still many ways of directly improving the assessor we have used here. For instance, we could use post-hoc calibration with methods such as isotonic regression, or add instance weights to the results of systems we especially care about. There are also many questions to further investigate. Do assessors work for other tasks? Can we use a small LM instead of a random forest to allow free form input? What is the agreement between different systems, and with the assessor?</p><p>These future ideas could be useful from the perspective of saving computing costs as we outlined before, but the schema is of wider applicability. There is a lot of useful information generated during the evaluation process that is lost upon aggregation. Assessors are an attempt at capturing this information and providing expectation management that is external, fine grained, anticipative, and can make use of population data. We could use them as instance-level model selectors, or we might be able apply explainability techniques on the assessor to find out what makes an instance difficult.</p><p>There is definitely more to explore around the topic of assessors, which perform granular assessments beyond generic aggregated results: saving compute by rejecting examples where the original model is going to fail is an important illustrative application.</p></div><figure xmlns="http://www.tei-c.org/ns/1.0" xml:id="fig_0"><head>Figure 1 :</head><label>1</label><figDesc>Figure 1: (Top) Process of a LM generating the solution for a date transformation (repetitive) problem in a spreadsheet.Once the user prompts one instance of the desired transformation (row 1), the LM proceeds to transforming the rest of instances (rows 2 &amp; onward). (Bottom) Process of an assessor that can reliably predict beforehand the performance of the LM at the instance-level.</figDesc></figure>
<figure xmlns="http://www.tei-c.org/ns/1.0" xml:id="fig_1"><head>Figure 2 :</head><label>2</label><figDesc>Figure 2: Illustration of the matching requirements for making a train-test partition for the assessor. Each column represents a data wrangling prompt used to evaluate a LM. Orange columns represent instances included in the training set for the assessor, while green represents those included in the test set for the assessor. To avoid contamination, the same instances should be used across different shots and systems.</figDesc><graphic coords="4,89.52,84.19,202.92,93.02" type="bitmap" /></figure>
<figure xmlns="http://www.tei-c.org/ns/1.0" xml:id="fig_2"><head>Figure 3 :</head><label>3</label><figDesc>Figure 3: Example of metafeatures that can be extracted from the examples of different domains (dates and emails in the figure). Adapted from [29].</figDesc></figure>
<figure xmlns="http://www.tei-c.org/ns/1.0" xml:id="fig_3"><head>Figure 4 :</head><label>4</label><figDesc>Figure 4: LMs' accuracies per LM and size on the original data-wrangling task. Logarithmic scale used on the 𝑥-axis.</figDesc></figure>
<figure xmlns="http://www.tei-c.org/ns/1.0" type="table" xml:id="tab_2"><head>Table 3</head><label>3</label><figDesc>Ablation study of the impact of various features on assessor performance. The 54 instance metafeatures are always included. Row 4, in italics, indicates the assessor we reported in Table1 and Table 2.</figDesc><table><row><cell></cell><cell cols="4">s y s t e m i d # p a r a m e t e r s t e m p l a t e i d f a m . &amp; s p a r s . # s h o t s</cell><cell>AUROC ( R )</cell><cell>BS (CAL, REF) ( R )</cell></row><row><cell>1</cell><cell>•</cell><cell></cell><cell></cell><cell></cell><cell>0.909</cell><cell>0.130 (0.015, 0.115)</cell></row><row><cell>2</cell><cell>•</cell><cell></cell><cell></cell><cell>•</cell><cell>0.910</cell><cell>0.130 (0.015, 0.115)</cell></row><row><cell>3</cell><cell>•</cell><cell></cell><cell>•</cell><cell>•</cell><cell>0.912</cell><cell>0.127 (0.014, 0.113)</cell></row><row><cell>4</cell><cell>•</cell><cell>•</cell><cell>•</cell><cell>•</cell><cell>0.916</cell><cell>0.128 (0.017, 0.111)</cell></row><row><cell>5</cell><cell></cell><cell>•</cell><cell>•</cell><cell>•</cell><cell>0.917</cell><cell>0.126 (0.015, 0.111)</cell></row><row><cell>6</cell><cell></cell><cell>•</cell><cell></cell><cell>•</cell><cell>0.916</cell><cell>0.126 (0.015, 0.111)</cell></row><row><cell>7</cell><cell></cell><cell>•</cell><cell>•</cell><cell>•</cell><cell>0.916</cell><cell>0.128 (0.016, 0.112)</cell></row><row><cell>8</cell><cell></cell><cell></cell><cell>•</cell><cell>•</cell><cell>0.869</cell><cell>0.167 (0.030, 0.137)</cell></row><row><cell>9</cell><cell></cell><cell></cell><cell>•</cell><cell>•</cell><cell>0.868</cell><cell>0.168 (0.031, 0.137)</cell></row><row><cell>10</cell><cell></cell><cell></cell><cell></cell><cell>•</cell><cell>0.865</cell><cell>0.170 (0.033, 0.137)</cell></row></table></figure>
			<note xmlns="http://www.tei-c.org/ns/1.0" place="foot" n="1" xml:id="foot_0">https://openai.com/api/pricing/</note>
			<note xmlns="http://www.tei-c.org/ns/1.0" place="foot" n="2" xml:id="foot_1">http://dmip.webs.upv.es/datawrangling/</note>
			<note xmlns="http://www.tei-c.org/ns/1.0" place="foot" n="3" xml:id="foot_2">The prompt-template differs between<ref type="bibr" target="#b9">[10]</ref> and<ref type="bibr" target="#b33">[34]</ref>, but the same metafeatures can be extracted.3 for an example). We refer to<ref type="bibr" target="#b28">[29]</ref> for an overview. The binary metafeatures are available for all input and output that is in the prompt, so for example for a 2-shot prompt, we would have 2 inputs and 2 outputs from the examples, and 1 input for the actual question, totalling 5 ⋅ 54 = 270 features.</note>
			<note xmlns="http://www.tei-c.org/ns/1.0" place="foot" n="4" xml:id="foot_3">The non-standard train-test partition is the result of the instance matching procedure described in section 3.2.</note>
			<note xmlns="http://www.tei-c.org/ns/1.0" place="foot" n="5" xml:id="foot_4"><ref type="bibr" target="#b4">5</ref> The RandomForest package (https://cran.r-project.org/web/ packages/randomForest/index.html) was used for training the assessor model.</note>
		</body>
		<back>

			<div type="acknowledgement">
<div xmlns="http://www.tei-c.org/ns/1.0"><head>Acknowledgments</head><p>We thank the anonymous reviewers for their comments. This work has been partially supported by the Norwegian Research Council grant 329745 Machine Teaching for Explainable AI, also by the EU (FEDER) and Spanish MINECO grant RTI2018-094403-B-C32 funded by MCIN/AEI/10.13039/501100011033 and by "ERDF A way of making Europe", Generalitat Valenciana under grant PROMETEO/2019/098, EU's Horizon 2020 research and innovation programme under grant agreement No. 952215 (TAILOR), US DARPA HR00112120007 (RECoG-AI), and INNEST/2021/317 (Project cofunded by the European Union with the "Programa Operativo del Fondo Europeo de Desarrollo Regional (FEDER) de la Comunitat Valenciana 2014-2020") and "the UPV (Vicerrectorado de Investigación) grant PAI-10-21".</p></div>
			</div>

			<div type="references">

				<listBibl>

<biblStruct xml:id="b0">
	<monogr>
		<author>
			<persName><forename type="first">J</forename><surname>Devlin</surname></persName>
		</author>
		<author>
			<persName><forename type="first">M.-W</forename><surname>Chang</surname></persName>
		</author>
		<author>
			<persName><forename type="first">K</forename><surname>Lee</surname></persName>
		</author>
		<author>
			<persName><forename type="first">K</forename><surname>Toutanova</surname></persName>
		</author>
		<idno type="arXiv">arXiv:1810.04805</idno>
		<title level="m">Bert: Pre-training of deep bidirectional transformers for language understanding</title>
				<imprint>
			<date type="published" when="2018">2018</date>
		</imprint>
	</monogr>
	<note type="report_type">arXiv preprint</note>
</biblStruct>

<biblStruct xml:id="b1">
	<monogr>
		<author>
			<persName><forename type="first">C</forename><surname>Raffel</surname></persName>
		</author>
		<author>
			<persName><forename type="first">N</forename><surname>Shazeer</surname></persName>
		</author>
		<author>
			<persName><forename type="first">A</forename><surname>Roberts</surname></persName>
		</author>
		<author>
			<persName><forename type="first">K</forename><surname>Lee</surname></persName>
		</author>
		<author>
			<persName><forename type="first">S</forename><surname>Narang</surname></persName>
		</author>
		<author>
			<persName><forename type="first">M</forename><surname>Matena</surname></persName>
		</author>
		<author>
			<persName><forename type="first">Y</forename><surname>Zhou</surname></persName>
		</author>
		<author>
			<persName><forename type="first">W</forename><surname>Li</surname></persName>
		</author>
		<author>
			<persName><forename type="first">P</forename><forename type="middle">J</forename><surname>Liu</surname></persName>
		</author>
		<idno type="arXiv">arXiv:1910.10683</idno>
		<title level="m">Exploring the limits of transfer learning with a unified text-totext transformer</title>
				<imprint>
			<date type="published" when="2019">2019</date>
		</imprint>
	</monogr>
	<note type="report_type">arXiv preprint</note>
</biblStruct>

<biblStruct xml:id="b2">
	<analytic>
		<title level="a" type="main">Language models are few-shot learners</title>
		<author>
			<persName><forename type="first">T</forename><forename type="middle">B</forename><surname>Brown</surname></persName>
		</author>
		<author>
			<persName><forename type="first">B</forename><surname>Mann</surname></persName>
		</author>
		<author>
			<persName><forename type="first">N</forename><surname>Ryder</surname></persName>
		</author>
		<author>
			<persName><forename type="first">M</forename><surname>Subbiah</surname></persName>
		</author>
		<author>
			<persName><forename type="first">J</forename><surname>Kaplan</surname></persName>
		</author>
		<author>
			<persName><forename type="first">P</forename><surname>Dhariwal</surname></persName>
		</author>
		<author>
			<persName><forename type="first">A</forename><surname>Neelakantan</surname></persName>
		</author>
		<author>
			<persName><forename type="first">P</forename><surname>Shyam</surname></persName>
		</author>
		<author>
			<persName><forename type="first">G</forename><surname>Sastry</surname></persName>
		</author>
		<author>
			<persName><forename type="first">A</forename><surname>Askell</surname></persName>
		</author>
	</analytic>
	<monogr>
		<title level="j">Advances in Neural Information Processing Systems</title>
		<imprint>
			<biblScope unit="volume">33</biblScope>
			<biblScope unit="page" from="1877" to="1901" />
			<date type="published" when="2020">2020</date>
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b3">
	<monogr>
		<author>
			<persName><forename type="first">E</forename><surname>Kharitonov</surname></persName>
		</author>
		<author>
			<persName><forename type="first">A</forename><surname>Lee</surname></persName>
		</author>
		<author>
			<persName><forename type="first">A</forename><surname>Polyak</surname></persName>
		</author>
		<author>
			<persName><forename type="first">Y</forename><surname>Adi</surname></persName>
		</author>
		<author>
			<persName><forename type="first">J</forename><surname>Copet</surname></persName>
		</author>
		<author>
			<persName><forename type="first">K</forename><surname>Lakhotia</surname></persName>
		</author>
		<author>
			<persName><forename type="first">T.-A</forename><surname>Nguyen</surname></persName>
		</author>
		<author>
			<persName><forename type="first">M</forename><surname>Rivière</surname></persName>
		</author>
		<author>
			<persName><forename type="first">A</forename><surname>Mohamed</surname></persName>
		</author>
		<author>
			<persName><forename type="first">E</forename><surname>Dupoux</surname></persName>
		</author>
		<author>
			<persName><forename type="first">W.-N</forename><surname>Hsu</surname></persName>
		</author>
		<idno type="arXiv">arXiv:2109.03264</idno>
		<idno>arXiv:2109.03264</idno>
		<title level="m">Text-Free Prosody-Aware Generative Spoken Language Modeling</title>
				<imprint>
			<date type="published" when="2021">2021</date>
		</imprint>
	</monogr>
	<note>cs, eess</note>
</biblStruct>

<biblStruct xml:id="b4">
	<monogr>
		<author>
			<persName><forename type="first">J</forename><forename type="middle">W</forename><surname>Rae</surname></persName>
		</author>
		<author>
			<persName><forename type="first">S</forename><surname>Borgeaud</surname></persName>
		</author>
		<author>
			<persName><forename type="first">T</forename><surname>Cai</surname></persName>
		</author>
		<author>
			<persName><forename type="first">K</forename><surname>Millican</surname></persName>
		</author>
		<author>
			<persName><forename type="first">J</forename><surname>Hoffmann</surname></persName>
		</author>
		<author>
			<persName><forename type="first">F</forename><surname>Song</surname></persName>
		</author>
		<author>
			<persName><forename type="first">J</forename><surname>Aslanides</surname></persName>
		</author>
		<author>
			<persName><forename type="first">S</forename><surname>Henderson</surname></persName>
		</author>
		<author>
			<persName><forename type="first">R</forename><surname>Ring</surname></persName>
		</author>
		<author>
			<persName><forename type="first">S</forename><surname>Young</surname></persName>
		</author>
		<idno type="arXiv">arXiv:2112.11446</idno>
		<title level="m">Scaling language models: Methods, analysis &amp; insights from training Gopher</title>
				<imprint>
			<date type="published" when="2021">2021</date>
		</imprint>
	</monogr>
	<note type="report_type">arXiv preprint</note>
</biblStruct>

<biblStruct xml:id="b5">
	<monogr>
		<author>
			<persName><forename type="first">D</forename><surname>Hendrycks</surname></persName>
		</author>
		<author>
			<persName><forename type="first">C</forename><surname>Burns</surname></persName>
		</author>
		<author>
			<persName><forename type="first">S</forename><surname>Basart</surname></persName>
		</author>
		<author>
			<persName><forename type="first">A</forename><surname>Zou</surname></persName>
		</author>
		<author>
			<persName><forename type="first">M</forename><surname>Mazeika</surname></persName>
		</author>
		<author>
			<persName><forename type="first">D</forename><surname>Song</surname></persName>
		</author>
		<author>
			<persName><forename type="first">J</forename><surname>Steinhardt</surname></persName>
		</author>
		<idno type="arXiv">arXiv:2009.03300</idno>
		<title level="m">Measuring massive multitask language understanding</title>
				<imprint>
			<date type="published" when="2020">2020</date>
		</imprint>
	</monogr>
	<note type="report_type">arXiv preprint</note>
</biblStruct>

<biblStruct xml:id="b6">
	<monogr>
		<author>
			<persName><forename type="first">D</forename><surname>Hendrycks</surname></persName>
		</author>
		<author>
			<persName><forename type="first">S</forename><surname>Basart</surname></persName>
		</author>
		<author>
			<persName><forename type="first">S</forename><surname>Kadavath</surname></persName>
		</author>
		<author>
			<persName><forename type="first">M</forename><surname>Mazeika</surname></persName>
		</author>
		<author>
			<persName><forename type="first">A</forename><surname>Arora</surname></persName>
		</author>
		<author>
			<persName><forename type="first">E</forename><surname>Guo</surname></persName>
		</author>
		<author>
			<persName><forename type="first">C</forename><surname>Burns</surname></persName>
		</author>
		<author>
			<persName><forename type="first">S</forename><surname>Puranik</surname></persName>
		</author>
		<author>
			<persName><forename type="first">H</forename><surname>He</surname></persName>
		</author>
		<author>
			<persName><forename type="first">D</forename><surname>Song</surname></persName>
		</author>
		<author>
			<persName><forename type="first">J</forename><surname>Steinhardt</surname></persName>
		</author>
		<idno type="arXiv">arXiv:2105.09938</idno>
		<title level="m">Measuring coding challenge competence with APPS</title>
				<imprint>
			<date type="published" when="2021">2021</date>
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b7">
	<monogr>
		<author>
			<persName><forename type="first">R</forename><surname>Bommasani</surname></persName>
		</author>
		<idno type="arXiv">arXiv:2108.07258</idno>
		<title level="m">On the opportunities and risks of foundation models</title>
				<imprint>
			<date type="published" when="2021">2021</date>
		</imprint>
	</monogr>
	<note type="report_type">arXiv preprint</note>
</biblStruct>

<biblStruct xml:id="b8">
	<monogr>
		<author>
			<persName><forename type="first">L</forename><surname>Ouyang</surname></persName>
		</author>
		<author>
			<persName><forename type="first">J</forename><surname>Wu</surname></persName>
		</author>
		<author>
			<persName><forename type="first">X</forename><surname>Jiang</surname></persName>
		</author>
		<author>
			<persName><forename type="first">D</forename><surname>Almeida</surname></persName>
		</author>
		<author>
			<persName><forename type="first">C</forename><forename type="middle">L</forename><surname>Wainwright</surname></persName>
		</author>
		<author>
			<persName><forename type="first">P</forename><surname>Mishkin</surname></persName>
		</author>
		<author>
			<persName><forename type="first">C</forename><surname>Zhang</surname></persName>
		</author>
		<author>
			<persName><forename type="first">S</forename><surname>Agarwal</surname></persName>
		</author>
		<author>
			<persName><forename type="first">K</forename><surname>Slama</surname></persName>
		</author>
		<author>
			<persName><forename type="first">A</forename><surname>Ray</surname></persName>
		</author>
		<idno type="arXiv">arXiv:2203.02155</idno>
		<title level="m">Training language models to follow instructions with human feedback</title>
				<imprint>
			<date type="published" when="2022">2022</date>
		</imprint>
	</monogr>
	<note type="report_type">arXiv preprint</note>
</biblStruct>

<biblStruct xml:id="b9">
	<monogr>
		<title level="m" type="main">Beyond the imitation game: Quantifying and extrapolating the capabilities of language models</title>
		<author>
			<persName><forename type="first">A</forename><surname>Srivastava</surname></persName>
		</author>
		<author>
			<persName><forename type="first">A</forename><surname>Rastogi</surname></persName>
		</author>
		<idno type="DOI">10.48550/ARXIV.2206.04615</idno>
		<ptr target="https://arxiv.org/abs/2206.04615.doi:10.48550/ARXIV.2206.04615" />
		<imprint>
			<date type="published" when="2022">2022</date>
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b10">
	<analytic>
		<title level="a" type="main">Classification with reject option</title>
		<author>
			<persName><forename type="first">R</forename><surname>Herbei</surname></persName>
		</author>
		<author>
			<persName><forename type="first">M</forename><forename type="middle">H</forename><surname>Wegkamp</surname></persName>
		</author>
	</analytic>
	<monogr>
		<title level="j">The Canadian Journal of Statistics/La Revue Canadienne de Statistique</title>
		<imprint>
			<biblScope unit="page" from="709" to="721" />
			<date type="published" when="2006">2006</date>
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b11">
	<analytic>
		<title level="a" type="main">An optimal reject rule for binary classifiers</title>
		<author>
			<persName><forename type="first">F</forename><surname>Tortorella</surname></persName>
		</author>
	</analytic>
	<monogr>
		<title level="m">Joint IAPR International Workshops on Statistical Techniques in Pattern Recognition (SPR) and Structural and Syntactic Pattern Recognition (SSPR)</title>
				<imprint>
			<publisher>Springer</publisher>
			<date type="published" when="2000">2000</date>
			<biblScope unit="page" from="611" to="620" />
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b12">
	<monogr>
		<author>
			<persName><forename type="first">K</forename><surname>Hendrickx</surname></persName>
		</author>
		<author>
			<persName><forename type="first">L</forename><surname>Perini</surname></persName>
		</author>
		<author>
			<persName><forename type="first">D</forename><surname>Van Der Plas</surname></persName>
		</author>
		<author>
			<persName><forename type="first">W</forename><surname>Meert</surname></persName>
		</author>
		<author>
			<persName><forename type="first">J</forename><surname>Davis</surname></persName>
		</author>
		<idno type="arXiv">arXiv:2107.11277</idno>
		<title level="m">Machine learning with a reject option: A survey</title>
				<imprint>
			<date type="published" when="2021">2021</date>
		</imprint>
	</monogr>
	<note type="report_type">arXiv preprint</note>
</biblStruct>

<biblStruct xml:id="b13">
	<analytic>
		<title level="a" type="main">How Can We Know When Language Models Know? On the Calibration of Language Models for Question Answering</title>
		<author>
			<persName><forename type="first">Z</forename><surname>Jiang</surname></persName>
		</author>
		<author>
			<persName><forename type="first">J</forename><surname>Araki</surname></persName>
		</author>
		<author>
			<persName><forename type="first">H</forename><surname>Ding</surname></persName>
		</author>
		<author>
			<persName><forename type="first">G</forename><surname>Neubig</surname></persName>
		</author>
		<idno type="DOI">10.1162/tacl_a_00407</idno>
	</analytic>
	<monogr>
		<title level="j">Transactions of the Association for Computational Linguistics</title>
		<imprint>
			<biblScope unit="volume">9</biblScope>
			<biblScope unit="page" from="962" to="977" />
			<date type="published" when="2021">2021</date>
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b14">
	<analytic>
		<title level="a" type="main">Training on the test set: Mapping the system-problem space in AI</title>
		<author>
			<persName><forename type="first">J</forename><surname>Hernández-Orallo</surname></persName>
		</author>
		<author>
			<persName><forename type="first">W</forename><surname>Schellaert</surname></persName>
		</author>
		<author>
			<persName><forename type="first">F</forename><surname>Martınez-Plumed</surname></persName>
		</author>
	</analytic>
	<monogr>
		<title level="m">Proceedings of the AAAI Conference on Artificial Intelligence</title>
				<meeting>the AAAI Conference on Artificial Intelligence</meeting>
		<imprint>
			<date type="published" when="2022">2022</date>
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b15">
	<analytic>
		<title level="a" type="main">Research directions in data wrangling: Visualizations and transformations for usable and credible data</title>
		<author>
			<persName><forename type="first">S</forename><surname>Kandel</surname></persName>
		</author>
		<author>
			<persName><forename type="first">J</forename><surname>Heer</surname></persName>
		</author>
		<author>
			<persName><forename type="first">C</forename><surname>Plaisant</surname></persName>
		</author>
		<author>
			<persName><forename type="first">J</forename><surname>Kennedy</surname></persName>
		</author>
		<author>
			<persName><forename type="first">F</forename><surname>Van Ham</surname></persName>
		</author>
		<author>
			<persName><forename type="first">N</forename><forename type="middle">H</forename><surname>Riche</surname></persName>
		</author>
		<author>
			<persName><forename type="first">C</forename><surname>Weaver</surname></persName>
		</author>
		<author>
			<persName><forename type="first">B</forename><surname>Lee</surname></persName>
		</author>
		<author>
			<persName><forename type="first">D</forename><surname>Brodbeck</surname></persName>
		</author>
		<author>
			<persName><forename type="first">P</forename><surname>Buono</surname></persName>
		</author>
	</analytic>
	<monogr>
		<title level="j">Information Visualization</title>
		<imprint>
			<biblScope unit="volume">10</biblScope>
			<biblScope unit="page" from="271" to="288" />
			<date type="published" when="2011">2011</date>
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b16">
	<analytic>
		<title level="a" type="main">Data wrangling for big data: Challenges and opportunities</title>
		<author>
			<persName><forename type="first">T</forename><surname>Furche</surname></persName>
		</author>
		<author>
			<persName><forename type="first">G</forename><surname>Gottlob</surname></persName>
		</author>
		<author>
			<persName><forename type="first">L</forename><surname>Libkin</surname></persName>
		</author>
		<author>
			<persName><forename type="first">G</forename><surname>Orsi</surname></persName>
		</author>
		<author>
			<persName><forename type="first">N</forename><surname>Paton</surname></persName>
		</author>
	</analytic>
	<monogr>
		<title level="m">Proceedings of the 19th International Conference on Extending Database Technology</title>
				<meeting>the 19th International Conference on Extending Database Technology</meeting>
		<imprint>
			<date type="published" when="2016">2016. 2016</date>
			<biblScope unit="page" from="473" to="478" />
		</imprint>
	</monogr>
	<note>Advances in Database Technology-EDBT</note>
</biblStruct>

<biblStruct xml:id="b17">
	<monogr>
		<title level="m" type="main">Comparison between machine learning and human learning from examples generated with machine teaching</title>
		<author>
			<persName><forename type="first">G</forename><forename type="middle">Jaimovitch</forename><surname>López</surname></persName>
		</author>
		<imprint>
			<date type="published" when="2020">2020</date>
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b18">
	<monogr>
		<author>
			<persName><forename type="first">W</forename><surname>Zeng</surname></persName>
		</author>
		<author>
			<persName><forename type="first">X</forename><surname>Ren</surname></persName>
		</author>
		<author>
			<persName><forename type="first">T</forename><surname>Su</surname></persName>
		</author>
		<author>
			<persName><forename type="first">H</forename><surname>Wang</surname></persName>
		</author>
		<author>
			<persName><forename type="first">Y</forename><surname>Liao</surname></persName>
		</author>
		<author>
			<persName><forename type="first">Z</forename><surname>Wang</surname></persName>
		</author>
		<author>
			<persName><forename type="first">X</forename><surname>Jiang</surname></persName>
		</author>
		<author>
			<persName><forename type="first">Z</forename><surname>Yang</surname></persName>
		</author>
		<author>
			<persName><forename type="first">K</forename><surname>Wang</surname></persName>
		</author>
		<author>
			<persName><forename type="first">X</forename><surname>Zhang</surname></persName>
		</author>
		<idno type="arXiv">arXiv:2104.12369</idno>
		<title level="m">Pangu-𝛼: Large-scale autoregressive pretrained chinese language models with auto-parallel computation</title>
				<imprint>
			<date type="published" when="2021">2021</date>
		</imprint>
	</monogr>
	<note type="report_type">arXiv preprint</note>
</biblStruct>

<biblStruct xml:id="b19">
	<monogr>
		<author>
			<persName><forename type="first">N</forename><surname>Du</surname></persName>
		</author>
		<author>
			<persName><forename type="first">Y</forename><surname>Huang</surname></persName>
		</author>
		<author>
			<persName><forename type="first">A</forename><forename type="middle">M</forename><surname>Dai</surname></persName>
		</author>
		<author>
			<persName><forename type="first">S</forename><surname>Tong</surname></persName>
		</author>
		<author>
			<persName><forename type="first">D</forename><surname>Lepikhin</surname></persName>
		</author>
		<author>
			<persName><forename type="first">Y</forename><surname>Xu</surname></persName>
		</author>
		<author>
			<persName><forename type="first">M</forename><surname>Krikun</surname></persName>
		</author>
		<author>
			<persName><forename type="first">Y</forename><surname>Zhou</surname></persName>
		</author>
		<author>
			<persName><forename type="first">A</forename><forename type="middle">W</forename><surname>Yu</surname></persName>
		</author>
		<author>
			<persName><forename type="first">O</forename><surname>Firat</surname></persName>
		</author>
		<author>
			<persName><forename type="first">B</forename><surname>Zoph</surname></persName>
		</author>
		<author>
			<persName><forename type="first">L</forename><surname>Fedus</surname></persName>
		</author>
		<author>
			<persName><forename type="first">M</forename><surname>Bosma</surname></persName>
		</author>
		<author>
			<persName><forename type="first">Z</forename><surname>Zhou</surname></persName>
		</author>
		<author>
			<persName><forename type="first">T</forename><surname>Wang</surname></persName>
		</author>
		<author>
			<persName><forename type="first">Y</forename><forename type="middle">E</forename><surname>Wang</surname></persName>
		</author>
		<author>
			<persName><forename type="first">K</forename><surname>Webster</surname></persName>
		</author>
		<author>
			<persName><forename type="first">M</forename><surname>Pellat</surname></persName>
		</author>
		<author>
			<persName><forename type="first">K</forename><surname>Robinson</surname></persName>
		</author>
		<author>
			<persName><forename type="first">K</forename><surname>Meier-Hellstern</surname></persName>
		</author>
		<author>
			<persName><forename type="first">T</forename><surname>Duke</surname></persName>
		</author>
		<author>
			<persName><forename type="first">L</forename><surname>Dixon</surname></persName>
		</author>
		<author>
			<persName><forename type="first">K</forename><surname>Zhang</surname></persName>
		</author>
		<author>
			<persName><forename type="first">Q</forename><forename type="middle">V</forename><surname>Le</surname></persName>
		</author>
		<author>
			<persName><forename type="first">Y</forename><surname>Wu</surname></persName>
		</author>
		<author>
			<persName><forename type="first">Z</forename><surname>Chen</surname></persName>
		</author>
		<author>
			<persName><forename type="first">C</forename><surname>Cui</surname></persName>
		</author>
		<idno type="arXiv">arXiv:2112.06905[cs</idno>
		<idno>arXiv:2112.06905</idno>
		<title level="m">GLaM: Efficient Scaling of Language Models with Mixture-of-Experts</title>
				<imprint>
			<date type="published" when="2021">2021</date>
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b20">
	<monogr>
		<author>
			<persName><forename type="first">S</forename><surname>Zhang</surname></persName>
		</author>
		<author>
			<persName><forename type="first">S</forename><surname>Roller</surname></persName>
		</author>
		<author>
			<persName><forename type="first">N</forename><surname>Goyal</surname></persName>
		</author>
		<author>
			<persName><forename type="first">M</forename><surname>Artetxe</surname></persName>
		</author>
		<author>
			<persName><forename type="first">M</forename><surname>Chen</surname></persName>
		</author>
		<author>
			<persName><forename type="first">S</forename><surname>Chen</surname></persName>
		</author>
		<author>
			<persName><forename type="first">C</forename><surname>Dewan</surname></persName>
		</author>
		<author>
			<persName><forename type="first">M</forename><surname>Diab</surname></persName>
		</author>
		<author>
			<persName><forename type="first">X</forename><surname>Li</surname></persName>
		</author>
		<author>
			<persName><forename type="first">X</forename><forename type="middle">V</forename><surname>Lin</surname></persName>
		</author>
		<author>
			<persName><forename type="first">T</forename><surname>Mihaylov</surname></persName>
		</author>
		<author>
			<persName><forename type="first">M</forename><surname>Ott</surname></persName>
		</author>
		<author>
			<persName><forename type="first">S</forename><surname>Shleifer</surname></persName>
		</author>
		<author>
			<persName><forename type="first">K</forename><surname>Shuster</surname></persName>
		</author>
		<author>
			<persName><forename type="first">D</forename><surname>Simig</surname></persName>
		</author>
		<author>
			<persName><forename type="first">P</forename><forename type="middle">S</forename><surname>Koura</surname></persName>
		</author>
		<author>
			<persName><forename type="first">A</forename><surname>Sridhar</surname></persName>
		</author>
		<author>
			<persName><forename type="first">T</forename><surname>Wang</surname></persName>
		</author>
		<author>
			<persName><forename type="first">L</forename><surname>Zettlemoyer</surname></persName>
		</author>
		<idno type="arXiv">arXiv:2205.01068[cs</idno>
		<idno>arXiv:2205.01068</idno>
		<title level="m">OPT: Open Pre-trained Transformer Language Models</title>
				<imprint>
			<date type="published" when="2022">2022</date>
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b21">
	<monogr>
		<title level="m" type="main">Scaling laws for neural language models</title>
		<author>
			<persName><forename type="first">J</forename><surname>Kaplan</surname></persName>
		</author>
		<author>
			<persName><forename type="first">S</forename><surname>Mccandlish</surname></persName>
		</author>
		<author>
			<persName><forename type="first">T</forename><surname>Henighan</surname></persName>
		</author>
		<author>
			<persName><forename type="first">T</forename><forename type="middle">B</forename><surname>Brown</surname></persName>
		</author>
		<author>
			<persName><forename type="first">B</forename><surname>Chess</surname></persName>
		</author>
		<author>
			<persName><forename type="first">R</forename><surname>Child</surname></persName>
		</author>
		<author>
			<persName><forename type="first">S</forename><surname>Gray</surname></persName>
		</author>
		<author>
			<persName><forename type="first">A</forename><surname>Radford</surname></persName>
		</author>
		<author>
			<persName><forename type="first">J</forename><surname>Wu</surname></persName>
		</author>
		<author>
			<persName><forename type="first">D</forename><surname>Amodei</surname></persName>
		</author>
		<idno>CoRR abs/2001.08361</idno>
		<ptr target="https://arxiv.org/abs/2001.08361.arXiv:2001.08361" />
		<imprint>
			<date type="published" when="2020">2020</date>
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b22">
	<monogr>
		<author>
			<persName><forename type="first">R</forename><surname>Desislavov</surname></persName>
		</author>
		<author>
			<persName><forename type="first">F</forename><surname>Martínez-Plumed</surname></persName>
		</author>
		<author>
			<persName><forename type="first">J</forename><surname>Hernández-Orallo</surname></persName>
		</author>
		<idno type="arXiv">arXiv:2109.05472</idno>
		<title level="m">Compute and energy consumption trends in deep learning inference</title>
				<imprint>
			<date type="published" when="2021">2021</date>
		</imprint>
	</monogr>
	<note type="report_type">arXiv preprint</note>
</biblStruct>

<biblStruct xml:id="b23">
	<monogr>
		<author>
			<persName><forename type="first">O</forename><surname>Sharir</surname></persName>
		</author>
		<author>
			<persName><forename type="first">B</forename><surname>Peleg</surname></persName>
		</author>
		<author>
			<persName><forename type="first">Y</forename><surname>Shoham</surname></persName>
		</author>
		<idno type="arXiv">arXiv:2004.08900</idno>
		<title level="m">The cost of training NLP models: A concise overview</title>
				<imprint>
			<date type="published" when="2020">2020</date>
		</imprint>
	</monogr>
	<note type="report_type">arXiv preprint</note>
</biblStruct>

<biblStruct xml:id="b24">
	<monogr>
		<author>
			<persName><forename type="first">B</forename><surname>Dickson</surname></persName>
		</author>
		<ptr target="https://bdtechtalks.com/2020/09/21/gpt-3-economy-business-model/" />
		<title level="m">The GPT-3 economy</title>
				<imprint>
			<date type="published" when="2020">2020</date>
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b25">
	<monogr>
		<author>
			<persName><forename type="first">D</forename><surname>Steinberg</surname></persName>
		</author>
		<ptr target="http://info.salford-systems.com/blog/bid/299181/How-Much-Time-Needs-to-be-Spent-Preparing-Data//-for-Analysis" />
		<title level="m">How much time needs to be spent preparing data for analysis?</title>
				<imprint>
			<date type="published" when="2013">2013</date>
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b26">
	<analytic>
		<title level="a" type="main">Towards automatic data format transformations: Data wrangling at scale</title>
		<author>
			<persName><forename type="first">A</forename><surname>Bogatu</surname></persName>
		</author>
		<author>
			<persName><forename type="first">N</forename><forename type="middle">W</forename><surname>Paton</surname></persName>
		</author>
		<author>
			<persName><forename type="first">A</forename><forename type="middle">A</forename><surname>Fernandes</surname></persName>
		</author>
	</analytic>
	<monogr>
		<title level="m">British International Conference on Databases</title>
				<imprint>
			<publisher>Springer</publisher>
			<date type="published" when="2017">2017</date>
			<biblScope unit="page" from="36" to="48" />
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b27">
	<analytic>
		<title level="a" type="main">Inductive programming meets the real world</title>
		<author>
			<persName><forename type="first">S</forename><surname>Gulwani</surname></persName>
		</author>
		<author>
			<persName><forename type="first">J</forename><surname>Hernández-Orallo</surname></persName>
		</author>
		<author>
			<persName><forename type="first">E</forename><surname>Kitzelmann</surname></persName>
		</author>
		<author>
			<persName><forename type="first">S</forename><forename type="middle">H</forename><surname>Muggleton</surname></persName>
		</author>
		<author>
			<persName><forename type="first">U</forename><surname>Schmid</surname></persName>
		</author>
		<author>
			<persName><forename type="first">B</forename><surname>Zorn</surname></persName>
		</author>
	</analytic>
	<monogr>
		<title level="j">Communications of the ACM</title>
		<imprint>
			<biblScope unit="volume">58</biblScope>
			<biblScope unit="page" from="90" to="99" />
			<date type="published" when="2015">2015</date>
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b28">
	<analytic>
		<title level="a" type="main">Automated data transformation with inductive programming and dynamic background knowledge</title>
		<author>
			<persName><forename type="first">L</forename><surname>Contreras-Ochando</surname></persName>
		</author>
		<author>
			<persName><forename type="first">C</forename><surname>Ferri</surname></persName>
		</author>
		<author>
			<persName><forename type="first">J</forename><surname>Hernández-Orallo</surname></persName>
		</author>
		<author>
			<persName><forename type="first">F</forename><surname>Martínez-Plumed</surname></persName>
		</author>
		<author>
			<persName><forename type="first">M</forename><forename type="middle">J</forename><surname>Ramírez-Quintana</surname></persName>
		</author>
		<author>
			<persName><forename type="first">S</forename><surname>Katayama</surname></persName>
		</author>
	</analytic>
	<monogr>
		<title level="m">Joint European Conference on Machine Learning and Knowledge Discovery in Databases</title>
				<imprint>
			<publisher>Springer</publisher>
			<date type="published" when="2019">2019</date>
			<biblScope unit="page" from="735" to="751" />
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b29">
	<analytic>
		<title level="a" type="main">Wrangler: Interactive visual specification of data transformation scripts</title>
		<author>
			<persName><forename type="first">S</forename><surname>Kandel</surname></persName>
		</author>
		<author>
			<persName><forename type="first">A</forename><surname>Paepcke</surname></persName>
		</author>
		<author>
			<persName><forename type="first">J</forename><surname>Hellerstein</surname></persName>
		</author>
		<author>
			<persName><forename type="first">J</forename><surname>Heer</surname></persName>
		</author>
	</analytic>
	<monogr>
		<title level="m">Proceedings of the sigchi conference on human factors in computing systems</title>
				<meeting>the sigchi conference on human factors in computing systems</meeting>
		<imprint>
			<date type="published" when="2011">2011</date>
			<biblScope unit="page" from="3363" to="3372" />
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b30">
	<analytic>
		<title level="a" type="main">Automating data science</title>
		<author>
			<persName><forename type="first">T</forename><surname>De Bie</surname></persName>
		</author>
		<author>
			<persName><forename type="first">L</forename><surname>De Raedt</surname></persName>
		</author>
		<author>
			<persName><forename type="first">J</forename><surname>Hernández-Orallo</surname></persName>
		</author>
		<author>
			<persName><forename type="first">H</forename><forename type="middle">H</forename><surname>Hoos</surname></persName>
		</author>
		<author>
			<persName><forename type="first">P</forename><surname>Smyth</surname></persName>
		</author>
		<author>
			<persName><forename type="first">C</forename><forename type="middle">K I</forename><surname>Williams</surname></persName>
		</author>
		<idno type="DOI">10.1145/3495256</idno>
	</analytic>
	<monogr>
		<title level="j">Communications of the ACM</title>
		<imprint>
			<biblScope unit="volume">65</biblScope>
			<biblScope unit="page" from="76" to="87" />
			<date type="published" when="2022">2022</date>
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b31">
	<monogr>
		<author>
			<persName><forename type="first">R</forename><surname>Puri</surname></persName>
		</author>
		<author>
			<persName><forename type="first">B</forename><surname>Catanzaro</surname></persName>
		</author>
		<idno type="arXiv">arXiv:1912.10165</idno>
		<title level="m">Zero-shot text classification with generative language models</title>
				<imprint>
			<date type="published" when="2019">2019</date>
		</imprint>
	</monogr>
	<note type="report_type">arXiv preprint</note>
</biblStruct>

<biblStruct xml:id="b32">
	<monogr>
		<title level="m" type="main">Exploiting cloze questions for few shot text classification and natural language inference</title>
		<author>
			<persName><forename type="first">T</forename><surname>Schick</surname></persName>
		</author>
		<author>
			<persName><forename type="first">H</forename><surname>Schütze</surname></persName>
		</author>
		<idno type="arXiv">arXiv:2001.07676</idno>
		<imprint>
			<date type="published" when="2020">2020</date>
		</imprint>
	</monogr>
	<note type="report_type">arXiv preprint</note>
</biblStruct>

<biblStruct xml:id="b33">
	<analytic>
		<title level="a" type="main">Can language models automate data wrangling?</title>
		<author>
			<persName><forename type="first">G</forename><surname>Jaimovitch-Lopez</surname></persName>
		</author>
		<author>
			<persName><forename type="first">C</forename><surname>Ferri</surname></persName>
		</author>
		<author>
			<persName><forename type="first">J</forename><surname>Hernandez-Orallo</surname></persName>
		</author>
		<author>
			<persName><forename type="first">F</forename><surname>Martınez-Plumed</surname></persName>
		</author>
		<author>
			<persName><forename type="first">M</forename><forename type="middle">J</forename><surname>Ramırez-Quintana</surname></persName>
		</author>
	</analytic>
	<monogr>
		<title level="m">Workshop on Automating Datascience at ECML-PKDD</title>
				<imprint>
			<date type="published" when="2021">2021</date>
			<biblScope unit="page">13</biblScope>
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b34">
	<monogr>
		<author>
			<persName><forename type="first">B</forename><surname>Zoph</surname></persName>
		</author>
		<author>
			<persName><forename type="first">I</forename><surname>Bello</surname></persName>
		</author>
		<author>
			<persName><forename type="first">S</forename><surname>Kumar</surname></persName>
		</author>
		<author>
			<persName><forename type="first">N</forename><surname>Du</surname></persName>
		</author>
		<author>
			<persName><forename type="first">Y</forename><surname>Huang</surname></persName>
		</author>
		<author>
			<persName><forename type="first">J</forename><surname>Dean</surname></persName>
		</author>
		<author>
			<persName><forename type="first">N</forename><surname>Shazeer</surname></persName>
		</author>
		<author>
			<persName><forename type="first">W</forename><surname>Fedus</surname></persName>
		</author>
		<idno type="arXiv">arXiv:2202.08906</idno>
		<title level="m">Designing effective sparse expert models</title>
				<imprint>
			<date type="published" when="2022">2022</date>
		</imprint>
	</monogr>
	<note type="report_type">arXiv preprint</note>
</biblStruct>

<biblStruct xml:id="b35">
	<analytic>
		<title level="a" type="main">Random forests</title>
		<author>
			<persName><forename type="first">L</forename><surname>Breiman</surname></persName>
		</author>
	</analytic>
	<monogr>
		<title level="j">Machine learning</title>
		<imprint>
			<biblScope unit="volume">45</biblScope>
			<biblScope unit="page" from="5" to="32" />
			<date type="published" when="2001">2001</date>
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b36">
	<analytic>
		<title level="a" type="main">A New Vector Partition of the Probability Score</title>
		<author>
			<persName><forename type="first">A</forename><surname>Murphy</surname></persName>
		</author>
	</analytic>
	<monogr>
		<title level="j">Journal of Applied Meteorology</title>
		<imprint>
			<biblScope unit="volume">12</biblScope>
			<biblScope unit="page" from="595" to="600" />
			<date type="published" when="1973">1973</date>
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b37">
	<analytic>
		<title level="a" type="main">On classification, ranking, and probability estimation</title>
		<author>
			<persName><forename type="first">P</forename><surname>Flach</surname></persName>
		</author>
		<author>
			<persName><forename type="first">E</forename><surname>Matsubara</surname></persName>
		</author>
	</analytic>
	<monogr>
		<title level="m">Dagstuhl Seminar Proceedings</title>
				<imprint>
			<publisher>Schloss Dagstuhl-Leibniz-Zentrum fr Informatik</publisher>
			<date type="published" when="2008">2008</date>
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b38">
	<analytic>
		<title level="a" type="main">A unified view of performance metrics: Translating threshold choice into expected classification loss</title>
		<author>
			<persName><forename type="first">J</forename><surname>Hernández-Orallo</surname></persName>
		</author>
		<author>
			<persName><forename type="first">P</forename><surname>Flach</surname></persName>
		</author>
		<author>
			<persName><forename type="first">C</forename><forename type="middle">Ferri</forename><surname>Ramírez</surname></persName>
		</author>
	</analytic>
	<monogr>
		<title level="j">Journal of Machine Learning Research</title>
		<imprint>
			<biblScope unit="volume">13</biblScope>
			<biblScope unit="page" from="2813" to="2869" />
			<date type="published" when="2012">2012</date>
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b39">
	<analytic>
		<title level="a" type="main">Calibration of machine learning models</title>
		<author>
			<persName><forename type="first">A</forename><surname>Bella</surname></persName>
		</author>
		<author>
			<persName><forename type="first">C</forename><surname>Ferri</surname></persName>
		</author>
		<author>
			<persName><forename type="first">J</forename><surname>Hernández-Orallo</surname></persName>
		</author>
		<author>
			<persName><forename type="first">M</forename><forename type="middle">J</forename><surname>Ramírez-Quintana</surname></persName>
		</author>
	</analytic>
	<monogr>
		<title level="m">Handbook of Research on Machine Learning Applications and Trends: Algorithms, Methods, and Techniques</title>
				<imprint>
			<publisher>IGI Global</publisher>
			<date type="published" when="2010">2010</date>
			<biblScope unit="page" from="128" to="146" />
		</imprint>
	</monogr>
</biblStruct>

				</listBibl>
			</div>
		</back>
	</text>
</TEI>
