A Framework for Categorising AI Evaluation Instruments Anthony G Cohn1 , José Hernández-Orallo2 , Julius Sechang Mboli3 , Yael Moros-Daval2 , Zhiliang Xiang4 and Lexin Zhou2 1 School of Computing, University of Leeds, UK; and the Turing Institute, UK 2 VRAIN, Universitat Politècnica de València, Spain 3 Faculty of Engineering and Informatics, University of Bradford, UK 4 IROHMS, School of Computer Science and Informatics, Cardiff University, UK Abstract The current and future capabilities of Artificial Intelligence (AI) are typically assessed with an ever increasing number of benchmarks, competitions, tests and evaluation standards, which are meant to work as AI evaluation instruments (EI). These EIs are not only increasing in number, but also in complexity and diversity, making it hard to understand this evaluation landscape in a meaningful way. In this paper we present an approach for categorising EIs using a set of 18 facets, accompanied by a rubric to allow anyone to apply the framework to any existing or new EI. We apply the rubric to 23 EIs in different domains through a team of raters, and analyse how consistent the rubric is and how well it works to distinguish between EIs and map the evaluation landscape in AI. Keywords Evaluation Instruments, Comparison of Evaluation Instruments, Categorisation of Evaluation Instruments, Artificial Intelli- gence Evaluation, Future of Skills 1. Introduction any real important progress in AI has been demonstrated by the entrants. In fact, Turing himself never proposed Ever since researchers started building AI systems, they the test as a serious way of measuring AI systems or of have wanted to evaluate them, either against human measuring progress, as Schieber [2] observes, adding, it benchmarks (such as playing humans experts at Chess or is “misguided and inappropriate” ([3, 4]). Instead he ar- other games) and/or against other AI systems. Finding gues for new “inducement prize” contests. According to good benchmarks for evaluating systems, and conducting Schieber, these are “award programs established to induce tests is harder than it might seem, particularly since we people to solve a problem of importance by directly re- believe we have good methods for evaluating human warding the solver”. Perhaps the most famous historical intelligence, via standard tests and examinations. examples are the Longitude Rewards offered by the UK There have been many tests proposed for evaluating government in 1714. A current example is the $5M IBM AI systems. Probably the most famous of these of course Watson AI XPRIZE which “challenges teams to demon- is known as the Turing Test[1]. There have been various strate how humans can work with AI to tackle global Turing Test competitions, of which the best known is the challenges”. Further discussion on the use of competi- annual Loebner Prize competition; the results have been tions, benchmarks and datasets in evaluating AI systems sometimes entertaining, and a way of promulgating ideas can be found in [5]. about AI to the general public, but it is hard to argue that The situation today is that there are thousands of chal- IJCAI2022 Workshop on AI Evaluation Beyond Metrics (EBeM’22), lenges in almost all areas of AI. They are increasing in July 24, 2022, Vienna, Austria complexity and diversity, as AI techniques evolve like- Envelope-Open a.g.cohn@leeds.ac.uk (A. G. Cohn); jorallo@upv.es wise. Because of this, it is hard to analyse this eval- (J. Hernández-Orallo); mboli4god@gmail.com (J. S. Mboli); uation landscape in a meaningful way. Motivated by ymordav@inf.upv.es (Y. Moros-Daval); xiangz6@cardiff.ac.uk this need, we present and discuss an approach to cate- (Z. Xiang); lzhou@inf.upv.es (L. Zhou) GLOBE https://eps.leeds.ac.uk/computing/staff/76/ gorising benchmarks, competitions, tests and evaluation professor-anthony-tony-g-cohn-freng-ceng-citp/ (A. G. Cohn); standards, jointly referred to as AI evaluation instruments http://josephorallo.webs.upv.es/ (J. Hernández-Orallo); (EI). We do this categorisation via a set of 18 facets, which https://jsmboli.github.io/jsmboli/ (J. S. Mboli); we believe will be valuable in distinguishing and evaluat- https://zl-xiang.github.io/ (Z. Xiang); https://lexzhou.github.io/ ing different proposals for evaluating AI systems. These (L. Zhou) facets, and an accompanying rubric to facilitate choosing Orcid 0000-0002-7652-8907 (A. G. Cohn); 0000-0001-9746-7632 (J. Hernández-Orallo); 0000-0003-1708-3052 (J. S. Mboli); appropriate values, are described in section 2. 0000-0001-5442-2055 (Y. Moros-Daval); 0000-0002-0263-7289 We will classify EIs using the facets in order to (a) (Z. Xiang); 0000-0003-1161-4270 (L. Zhou) evaluate how well the facets work in general and (b) to © 2022 Copyright for this paper by its authors. Use permitted under Creative Commons License Attribution 4.0 International (CC BY 4.0). what extent they help mapping the landscape of EIs and CEUR Workshop Proceedings http://ceur-ws.org ISSN 1613-0073 CEUR Workshop Proceedings (CEUR-WS.org) distinguish their differences. This may help inform how the options in brackets. Some options indicate ‘(specify)’, much we can translate from the facet values to guide the which means that the rater must indicate a (freetext) design of future EIs. We do not imagine there can be a value for that option. The full description of the facets single universal evaluation instrument, or even a battery usually include some examples and further clarifications2 . for each domain (vision, reasoning, etc.); certainly that Here we only include the basic definition of each of them. ideal has eluded the community so far. We do not even We use colours (blue and black) that are indicative, with aspire to find facet values that are valid for all EIs but blue referring to the preferred or most challenging case, our proposed work may help in directing future efforts in general. However, for some facets a blue value may in the evaluation of AI systems. make no sense, or we do not believe that one value is Since it is infeasible in a reasonable amount of time to ‘better’ than any other, so these facets have no coloured apply this categorisation to the thousands of EIs in the lit- facet value(s). erature, here we cover 23 EIs (see Table 2). By evaluating • Vp - Purpose [RESEARCH, CONFORMITY, OTHER (spec- a reasonable number of carefully chosen examples, we ify)]: Is the benchmark meant to foster research or hope to give a fair picture of the extent to which the as- development, or to certify whether an AI system con- pects of AI appraised by the facets are being tested in the forms with some level or standard? selected examples. Beyond the insights that we extract • Vc - Capability [TASK-PERFORMANCE (specify), CAPA- from this selected set of EIs, this paper and the rubric we BILITY (specify)]: Does the EI just measure observed have developed for the different facets should serve as (aggregated) performance on a TASK (e.g., protein fold- a reference for third parties (e.g., other researchers) to ing, credit scoring) or is the EI designed to also measure analyse other EIs. a CAPABILITY (e.g., object permanence, dealing with The rest of the paper is organised as follows. Sec- negation)? tion 2 presents the 18 facets and a rubric which explains • Vf - Reference [ABSOLUTE, RELATIVE (specify)]: Are how facet values should be chosen. Next, in section 3, results reported as an absolute metric (criterion- we discuss the criteria for selecting the 23 EIs and the referenced) or are they reported as a relative (percent- methodology the raters used to apply the rubric. Section age) metric to a reference (norm referenced), e.g., hu- 4 discusses the level of disagreement between raters for man performance? each facet and EI, and how the methodology and the • Vo - Coverage [BIASED (specify), REPRESENTA- number of raters was adapted based on these observa- TIVE]: Does the EI cover a BIASED or unbiased tions. Section 5 analyses the ratings of the 23 EIs, and (REPRESENTATIVE) distribution of what is meant to what they reveal about this group of EIs. Finally, section be measured? 6 closes with some general discussion and possible future • Vs - Specificity [SPECIFIC, CONTAMINATED]: Are the work. results precisely aligned with what is meant to be mea- sured or contaminated by other skills or tasks? • Vl - Realism [TOY, GAMIFIED, REALISTIC, REAL-LIFE]: 2. Characterising AI Evaluation To what extent is the EI a toy problem, a complex Instruments gamified problem, is it a realistic setting (e.g., but still in a simulated scenario, a lab or testing facility) or is We looked for existing features or dimensions to char- the evaluation itself happening in real life3 ? acterise EIs, but unfortunately we did not find any sys- • Cj - Judgeability [MANUAL, AUTOMATED, MIXED]: Is tematic account in AI, other than concepts such as repro- scoring manual (e.g., through human questionnaires ducibility, realism, coverage and specificity, usually re- or judges) or automated (e.g., correct answers or opti- ferred to with other names and applied to a single EI. We mality function) or a mixture? found more dimensions and a more systematic coverage • Cc - Containedness [FULLY-CONTAINED, PARTIAL- of evaluation instruments in the area of psychological INTERFERENCE (specify), NOT-CONTAINED (specify)]: testing. As a result, we have introduced a new set of Once started, is the testing isolated from external fac- facets, but when possible, the terminology is based on tors or interference possibly having an effect on the the common use in AI, but also incorporating terms and results (human participants, online data, weather, etc.), concepts from the Standards for Educational and Psycho- or is there some partial interference not affecting the logical Testing by the American Educational Research results significantly or is it dependent of external re- Association [6]. sources and conditions? The following list1 proposes 18 facets to characterise 2 existing and future EIs for AI. Each facet is followed by The latest version of the rubric can be found in https://tinyurl. com/mr2bv5hb 3 REAL-LIFE does not mean a final or specific product in oper- 1 Each facet has both a name and a two letter acronym, whose ation. It can also happen in very early stages of research, such as initial letter is V, C or F, the reason for which will become clear later. evaluating prototype chatbots in a real social network. • Cp - Reproducibility [NON-REPRODUCIBLE, STOCHAS- the system within the test? TIC, EXACT]: Is the evaluation non-reproducible, with • Fu - Autonomy [AUTONOMOUS, COUPLED (specify), results biased or spoiled if repeated; does the EI have COMPONENT]: Is it measuring an autonomous sys- stochastic components leading to different interac- tem, coupled with other systems (e.g., humans) or as a tions; or are the results completely reproducible, i.e. component? can exactly the same test (inputs, interaction, etc.) be The facets above can be grouped into three main cat- generated again for another (or the same) competitor? egories following the three main groups given by the • Cl - Reliability [RELIABLE, NON-RELIABLE, N/A]: Does Standards for Educational and Psychological Testing [6]: the evaluation present sufficient repetitions, episode validity, reliability/precision and fairness. We use these length or number of instances to give low variance three major groups to give some structure to the facets for the same subject when applied again (test-retest above. Roughly, these groups deal with what is measured, reliability)? If the testing methodology or the common how it is measured and who is measured, respectively. use of the EI is not clear then N/A may be the most • Validity group (Does it measure what we want to mea- appropriate facet value. sure?): Vp, Vc, Vf, Vo, Vs, Vl • Cv - Variation [FIXED, ALTERED, PROCEDURAL]: Is • Consistency (Reliability/Precision) group (Does it mea- the evaluation based on fixed datasets; have the in- sure it effectively and verifiably?: Cj, Cc, Cp, Cl, Cv, stances been altered by adding post-processing varia- Ca tions (noise, rotations, etc.); or have they been created • Fairness group (Does it treat all test takers equally?): (e.g., using procedural generation4 )? Fn, Fm, Fp, Fo, Fr, Fu • Ca - Adjustability [UNSTRUCTURED, ABLATABLE, Some of these are closely related, such as {Cv,Ca,Vo} or ADAPTIVE]: Is the analysis of results on the set of in- {Fo,Cp}. The term accommodation in [6] is “used to de- stances unstructured; or has the EI identified a set note changes with which the comparability of scores is of meta-features such as difficulty or dimension that retained, and the term modification is used to denote could be used to analyse the results by these dimen- changes that affect the construct measured by the test”. sions (ablatable); or are these meta-features used to This is related to Vs, Cv, Fo and Cc, and also to the term adaptively or adversarially choose the instances to test “measurement invariance”, which is very important here more informatively (adaptive)? to see if accommodations of the same test could evalu- • Fn - Antecedents [CREATED, RETROFITTED (specify)]: ate the same construct for different AI systems and even Is it devised on purpose for AI or adapted from tests humans. designed to test humans. • Fm - Ambition [SHORT, LONG]: When the EI was cre- ated, was it aiming at the short term (improving on 3. EI Selection and Rating the SOTA) or long term (more ambitious goals)? Methodology • Fp - Partiality [PARTIAL (specify), IMPARTIAL]: Does the EI favour particular technologies, conditions or Now that the facets and the rubric have been explained, cultures that should not have an influence on the result we proceed to discuss how the EIs were selected, what of the evaluation5 ? the final selection was, and what protocol we followed • Fo - Objectivity [LOOSE, CUSTOMISED, FULLY- in assigning EIs to the raters. INDEPENDENT]: Is it loosely defined, customised to each participant or does the EI have a predetermined 3.1. EI Selection independent specification6 . • Fr - Progression [STATIC, DEVELOPMENTAL]: Is the We considered evaluation instruments with the following score measuring a capability at one particular moment criteria for inclusion: or is it evaluating the development of the capability of • Potential interest to understand the future of AI skills: An EI might be regarded as being of interest if systems 4 Although we have coloured PROCEDURAL, we recognise which perform well on it can be regarded as indicat- that procedural may not always be better and can lead to problems ing a noteworthy change in the capabilities of AI in if variations are not in an appropriate proportion. Also, generated general. In other words, progress in this EI requires data may just lead to a learning algorithm reverse-engineering the significant enhancement of AI techniques beyond the generator. 5 Vo-Coverage is about the domain, whilst Fp-Partiality is specific requirements of the EI. about how the EI may favour some test-takers over others. • Diversity in the kind of task: We tried to cover a variety 6 LOOSE refers to cases when evaluation is very open, e.g., a of domains, formats and types of problems (vision, robotic-domain EI where we evaluate on a satisfactory interaction natural language, competitions, datasets, supervised, with the user, but not even a clear questionnaire is defined. FULLY- etc). INDEPENDENT could treat different groups differently if there is a reason for equality of treatment. Level 2 options 3 options 4 options Total Consistently Agreed Fr, Fn Vp , Cj, Cc, Fo, Fu - 7 Moderately Agreed Vf, Fp Cp, Cv Vl 5 Often Diverged Vc, Vo, Vs, Fm Cl, Ca - 6 Table 1 Level of agreement for the 18 facets, according to the number of options for each facets. • Popularity: How many teams have already used this 3.2. Rating Methodology EI? How many published papers refer to it? We can We devised a protocol to refine and validate the rubric, use proxies for this, such as citations to the original but also to cover as many EIs as possible, according to papers introducing the EI, the number of results on the number of raters we had available. We explain the websites such as paperswithcode.com. We also have to protocol below, but we note that this protocol can be consider that industry-related EI may be less popular adapted to other situations or can incorporate ideas from than research-oriented EIs. However, given the num- consensus-based ratings or the Delphi method [30]. First, ber of EIs selected, we repeat domains and cover just two of the authors of this paper (A.C. and J.H-O.) acted a few areas (e.g., NLP, vision, robotics) without being as coordinators for the rating process. A total of four comprehensive for all possible domains. raters were chosen. Raters were AI-related undergradu- • Currency: we prefer EIs still in active use or recently ate and graduate students, and were recruited through a introduced, rather than those which have fallen out of selection process and interviews. They are the other four use. authors of this paper (J-S.M., Y. M-D, Z.X. and L.Z.). Once The source of the EIs was mostly repositories7 and sur- the raters were appointed, each rater was given some veys, institutions such as NIST8 and LNE9 , and compe- meta-information about each EI (acronym, name, ma- titions at AI conferences. Then, we identified possible jor sources, what it measures, etc.) and had to complete gaps in terms of domains or whether we expect that the some other general information about each EI. They were answers for some facets are going to be too similar. We also asked some information about their own completion, also considered whether we would expect to get diversity such as time taken (in hours). in the values in blue for the facets, so that we get different We established three batches, covering 2, 11 and 10 levels of quality according to this colour code. Note that EIs respectively, in the order they are presented in Table at the time of selection we could of course only roughly 2. The first two EIs had already been used by the coor- estimate how many blue categories we might get for each dinators in developing the list of facets and their values. EI. Since we expected to learn more about the categoris- All the subsequent raters started off on these two EIs too ing of EIs as categorisation proceeded, we did not choose and were given feedback on their chosen values before all EIs in advance but selected them incrementally. The proceeding to any further EIs. We refer to these two EIs 23 selected EIs are shown in Table 2. as “Batch 1”. The next 11 EIs are referred to as “Batch These EIs cover a good distribution of benchmarks, 2”. These two batches were done by all four raters, inde- competitions and datasets, although some of them can pendently. After the analysis of consistency we deemed be considered to be in two of these categories. The term sufficient to only have two raters per EI. Then, a final ‘test’ to refer to an EI is less usual. About half of the set of 10 EIs, referred to as “Batch 3”, were each rated by 23 EIs require the use of language in the inputs and/or just two raters, for reasons of economy, since we already outputs, and about one half of them require some kind of had reasonable inter-rater consistency after the end of perception (mostly computer vision), with some overlap batches 1 and 2. The two raters for each EI were assigned in these two groups. Only a few of the EIs are related to so that all raters would have five EIs, and across their five navigation and robotics, in virtual (e.g., video games) or EIs, they co-rated with all the other three raters (i.e., one physical environments, and a small number are related to EI with one other rater and two EIs with each of the other more abstract capabilities or problems related to planning raters). In this first stage, they worked independently, not or optimisation. sharing values for any of the facets, and only reporting 7 http://paperswithcode.com, http://kaggle.com, questions and partial results to the coordinators. https://zenodo.org/record/4647824#.YV7CPdrMKUk, https: There were some changes of the rubric between //www.eff.org/ai/metrics, https://en.wikipedia.org/wiki/List_of_ batches, especially clarifying the description of some datasets_for_machine-learning_research, http://www.chalearn.org. of the facets, and in a few cases, changes in the number 8 https://www.nist.gov/programs-projects/ ai-measurement-and-evaluation 9 https://www.lne.fr/en/testing/ evaluation-artificial-intelligence-systems Table 2 EIs given to raters and included in our analysis. Acronym Type Domain Aim Year WSC [7] test, bench- LU, CS, reasoning It was specifically targeted to evaluate common sense reasoning, as an alter- 2016 mark & native to the Turing test, arguing conceptual and practical advantages competition ALE [8] benchmark VG; navigation; The original goal was to evaluate “general, domain-independent AI technol- 2013 perception ogy”, by using a diversity of video games, although what it measures more specifically is unclear. GLUE [9] benchmark LU; text retrieval; The goal of GLUE and superGLUE (an improvement/modified version of 2018 world knowledge GLUE) is to measure the performance (e.g. accuracy, F1-score) of an AI system in natural language understanding tasks (Single-Sentence Tasks, Similarity and Paraphrase Tasks, and Inference Tasks) in English. SUPERGLUE [10] benchmark video games; navi- The goal of GLUE and superGLUE (an improvement/modified version of 2019 gation; perception GLUE) is to measure the performance (e.g. accuracy, F1-score) of an AI system in natural language understanding tasks (Single-Sentence Tasks, Similarity and Paraphrase Tasks, and Inference Tasks) in English. IMAGENET [11] competition image classifi- Aims to measure the visual recognition capability for object recognition, 2010 cation; object image classification, and object localisation. The images can contain different recognition; object numbers of objects (e.g. mammal, bird, fish, vehicle, furniture, tool, flower, localisation fruit, etc.), occlusions, and clutters (i.e. diversity and noise). AIBIRDS [12] competition CV, VG, KRRP Measures the planning capability of an agent in a large action space, without 2010 knowing of the physical parameters of objects, situation given by Angry Birds. ICCMA [13] competition reasoning; AA, CL Aims to measure/compare the performance of different solvers regarding 2015 argumentation (particularly, reasoning problem that requires logic). Robocup SPL[14] competition RCRPVMASS The aim is to measure & promote improvements in multi-robot (humanoid) 1998 systems by playing soccer matches with robots Robocup@home competition HRIC, NMDE, CV, aims to measure the performance of the developed AI robots in providing 2006 [15] ABP. service with assistive robot technology with high relevance for future personal domestic applications. Librispeech- dataset speech recognition Aims to provide freely available read speech corpus in English that is suitable 2015 SL12 [16] for training and testing speech recognition systems. GVGAI [17] competition VG;general AI; PN Aimed to systems that can perform well in multiple video games, possibly 2014 without knowing the game in advance and with little to no specific domain knowledge, as an approximation to artificial general intelligence PIQA [18] benchmark PCU, NLP, reason- Aims to measure physical interaction reasoning about both the prototypical 2019 dataset ing use of objects (e.g., shoes are used for walking) and non-prototypical but practically plausible use of objects (e.g., shoes can be used as a doorstop). It targets language representations of knowledge traditionally only seen or experienced. SAT [19] competition boolean satisfiabil- Aims to keep progress & further improve the performance & robustness of SAT 2002 ity solvers, with a history dating back to the early 90s, thanks to the persistent efforts of the SAT community. VCR [20] dataset CR; cognition; VR It aims to measure the ability to infer what is happening in a picture (people’s 2019 actions, goals, etc.) from visual signs which are obvious for humans. Assembly [21] competition RM, ARH, MPLT, Identifying key competencies and characteristics of robotic systems using a 2017 DiHM, RGVELO, robust set of formalized evaluations and benchmarks. To help to match robotic Anthropomorphic hand capabilities to end-user needs as well as to help provide developers and researchers insight for improving their hardware and software designs IMDb [22] dataset NLP Detecting the sentiment of a piece of text 2011 SocialIQA [23] benchmark SI, SIn, EI, IR Aimed to measure the social and emotional intelligence of computational 2019 models through multiple choice question answering GGP [24] competition game playing General game playing (GGP) is the design of artificial intelligence programs 2005 to be able to play more than one game successfully. SQUAD2.0 [25] dataset reading compre- It aims to measure reading comprehension abilities that allows a system to 2018 hension; NLP get a correct answer to a given question when the solution can be extracted from the text or abstain from answering otherwise WikiQA [26] benchmark NLP WIKIQA is a dataset for opendomain question answering 2014 dataset sW/AG [27] dataset, bench- NLI, CR Aims to evaluate the performance of a system in grounded commonsense 2018 mark inference (reasoning about a situation and anticipate what might come next) by answering multiple choice questions L2RPN [28] competition SG, AI, PG, PN This challenge aims at testing the potential of AI to address this important 2012 real-world problem for our future. Lifelong- competition robotics, CV, RV Provides a robotic vision dataset collected from real time environments to 2019 Robots [29] accelerate both research and applications of visual models for robotics. Abbreviations: HRIC = Human-Robot-Interaction and Cooperation; NMDE = Navigation and Mapping in dynamic environments; CV = Computer Vision, ABP = Adaptive Behaviors, planning; AA = abstract argumentation; CL = computational logic; VG = video games; KRRP = knowledge representation; reasoning; planning; RCRPVMASS robotics; cooperation; real-time planning; vision; multiagent systems; strategy; LU = Language understanding; CS = common sense; RM = Robotics in Manufacturing; ARH = Adaptive Robot hands, MPLT = Manipulation planning based on learning techniques;DiHM = Dexterous in-hand manipulation; RGVELO = Robust grasping with various everyday life objects; SI = social interaction, SIn = social intelligence, EI = emotional intelligence, IR = inferential reasoning; CR = commonsense reasoning;VR = visual recognition; PN = planning and navigation, SG = Smart Grids, PG = Power Grids, PN = Power networks, PCU = physical commonsense understanding, NLI = natural language inference, RV = Robotic vision and/or name of the options. Whenever a change was The pattern of agreement or disagreement amongst introduced, the raters were informed and had to revisit the raters tend to vary depending on several factors such their ratings for previous batches. as facet complexity, available information on the EI, and In a second and final stage of the process, the coor- so on. In particular, we observe the following: dinators allowed the raters to exchange opinions, but • Fr, Fn, Vp, Cj, Cc, Fo, Fu are consistently agreed across they were not asked to reach a consensus, just to identify all batches, with very few disagreements. possible misunderstandings. From this discussion, a few • Vf, Vl, Cp, Cv, Fp appear to be moderately agreed and ratings were modified. Unless explicitly stated, we refer supported by a majority (≥ 75%). Notably, Vl has the to these final ratings in the rest of the paper. largest number of value options, but still agreed well by a majority. • While selections on Vo, Vs and Cl with binary options, 4. Analysis of Rater Consistency are two of the least agreed ones. It is not surprising that some of the facets consistently As noted above, the 1st and 2nd batches differ from batch reached consensus considering the facet values tend to 3 because the former had four raters whilst the latter only distribute towards one single selection (detailed in Sec- two. Thus, in the former case, a majority agreement can tion 5). For instance, as we will see in the following be formed with three or four raters agreeing, whilst in section, RESEARCH is picked for the Vp facet with only batch 3 only when both raters agree; hence ‘majority’ is one disagreement for all rounds. This might reflect the less statistically significant for the 3rd batch. For simplic- fact that some EIs do not have much variability in their ity, we will use round A and round B respectively when options. For example, most EIs are indeed proposed for referring to the first two batches and the 3rd batch. As the purpose of research (Vp), and given the low vari- shown in Figure 1, the level of agreement coincides to a ability in the values there cannot be much disagreement great extent when comparing the results from all batches (the variance of a Bernoulli distribution). As the variabil- (Figure 1, top) with the individual ones from round A ity of facets increases, choosing answers for the facets (Figure 1, middle) and round B (Figure 1, bottom). It can might require more EI-specific domain knowledge from be expected that those facets with more possible values the raters. For instance, to make justifiable decisions for (4) might have more disagreements than those with only facets like Vo and Vs, raters often need to seek related two possible values, simply for statistical reasons. We can literature for support when the answers were not clear see that in fact this is not having a big effect, as shown from the specifications of EIs. Whether an EI is specific in Table 1. (Vs) and general (Vo) enough for the measuring of certain Facets Agreements for Both Rounds capabilities is indeed hard to judge depending solely on 30 disagreement agreement the specifications. As such, information that is extracted 25 majority 1 5 5 1 1 4 1 1 3 from different sources might lead to disagreements on 20 7 7 10 11 8 8 selections. Counts 14 12 12 15 22 23 21 22 23 22 22 22 22 22 22 23 22 Moreover, subjectivity of a facet could also contribute 10 18 18 18 20 20 19 19 20 17 18 20 5 9 13 13 12 15 15 16 15 11 15 11 16 to value divergences. This might be a reasonable ex- Vp Vc Vf Vo Vs Vl Cj Cc Cp Cl Cv Ca Fn Fm Fp Fo Fr Fu planation for inconsistent selections in Vc, Ca and Fm since they allow raters more space for subjective inter- Facets Agreements for the Round A 16 disagreement agreement pretations. While relevant information w.r.t. Vc and 14 1 1 1 1 majority 1 Fm is often stated in the EI specifications, these state- 12 3 2 3 3 10 4 6 6 5 6 ments can somehow be interpreted in different degrees Counts 8 7 8 8 12 13 12 12 12 12 13 12 12 13 13 12 10 12 12 12 12 13 12 or ways. For example, an EI for natural language under- 6 11 4 9 9 7 7 10 10 8 7 10 6 9 9 10 10 standing (NLU) could aim at improving state-of-the-art 5 5 2 Vp Vc Vf Vo Vs Vl Cj Cc Cp Cl Cv Ca Fn Fm Fp Fo Fr Fu 3 performance (short-term) or measuring agents’ capabili- ties regarding NLU (long-term); object recognition could Facets Agreements for Round B disagreement be argued as a visual capability or a specific task. Having 10 agreement both option variability and subjectivity made the three 8 facets the least agreed ones. Also, some facets are re- Counts 6 10 10 10 10 10 10 lated, and a disagreement in one may be accompanied 9 9 9 4 6 8 7 6 8 8 6 with disagreement in others. For instance, when TASK- 2 4 5 PERFORMANCE is selected for Vc, the value of the Vs facet Vp Vc Vf Vo Vs Vl Cj Cc Cp Cl Cv Ca Fn Fm Fp Fo Fr Fu is more likely to be SPECIFIC. As such, Vs is more likely to Figure 1: Agreements on facet value ratings for the 23 EIs be diverged if disagreement occurred on Vc. This might and rounds A and B. also account for the high diverging rate of facets in the Validity group. In summary, apart from the statistical reason given by ments. Surprisingly, only around half of EIs were SPE- the number of values and their variability, the causes for CIFIC (Vs), i.e., another half were CONTAMINATED. All disagreement can be grouped into three blocks: the EIs that were designed for TASK-PERFORMANCE are • Similarity between facet values: The closeness or simi- always SPECIFIC (this is suggested in the rubric) but more larity between facet options might have also reduced interestingly, most EIs designed to measure CAPABILITY the chance of picking the right option. For example, are CONTAMINATED (i.e., the results do not completely for the facet Vl - Realism has four options (TOY, GAMI- align with what is meant to be measured). More effort is FIED, REALISTIC and REAL-LIFE), and it is not always needed to encourage reliable and robust methodologies easy to distinguish between REALISTIC and REAL-LIFE. to evaluate the capability of the AI systems, although • Insufficient Details: For many EIs, the information or we recognise sometimes it is inevitably hard to measure details provided by the organisers of the competition, reliably certain capabilities (e.g., common-sense reason- the test or the datasets in the EI is not sufficient to ing). With regard to realism (VI), REALISTIC EIs account understand what the EI is actually measuring. Other for a predominant proportion (circa 80%), implying con- EIs are well documented and have published articles siderable focus on measuring systems solving practical that make it easy to obtain meta-information and the problems, but the evaluation is not in an actual real-life facets values for such EIs. scenario; thus most EIs focus on evaluating the systems • Conflicting Information: One of the factors that did in simulated scenarios or scenarios which are an abstrac- not help is the source of information about each EI. tion of a real-world setting. For some EIs, there is perhaps too much information Consistency group (Does it measure it effectively and many papers using them, and they do not always and verifiably?): Nearly all EIs are FULLY-CONTAINED understand the same thing or use it in the same way. (Cc), implying current EIs enjoy high independence from One paper or website might be talking about task per- external factors during the assessment) and RELIABLE formance while other sources talk of capabilities or (Cl), which are desirable features. Regarding Cj, most both. EIs evaluate the systems with an AUTOMATED scoring Overall, given these sources and level of disagreement, as instead of MANUAL or MIXED. This phenomenon can be shown in Figure 1, we considered the rubric sufficiently double-edged since automated scoring is generally more validated to move from round A to round B with fewer objective and faster to calculate but also requires a proper raters, and for the analysis in the next section. definition for the scoring10 . For instance, how do we use an automated scoring to evaluate whether a robotic dancer or cook is good or bad? This may be easy for some 5. Analysis of Results human experts but quite hard to define using a metric. Things become particularly complicated when measur- Herein, we break down the results obtained by the raters ing a special capability, such as common-sense reasoning. to describe what they reveal about the 23 selected EIs (Ta- In terms of Cv, nearly all EIs are FIXED datasets. Almost ble 2). Figure 2 shows the frequencies of different options none had altered the instances by adding post-processing of the 23 EIs for each of the 18 facets. The frequency is variations or created new to cover a range of variations calculated differently in the first and the second round. In intrinsically, possibly because using fixed datasets is eas- round A, since we have four raters, each counts for 0.25 ier than modifying instances systematically. However, unit of frequency (if all chose the same option, it sums this could obstruct the diversity in the evaluation method- up to 1). In round B, we have two raters, each counts for ology (e.g., sometimes it would be interesting to see how 0.5. In total, we have a maximum frequency of 23 in each the system’s performance varies by adding noise to the option. data to test the model’s robustness). Surprisingly, most Validity group (Does it measure what we want EIs are UNSTRUCTURED or ABLATABLE (Ca), but almost to measure?): Nearly all EIs are designed to foster RE- none are ADAPTIVE. This might be because adaptive tests SEARCH (Vp) and use ABSOLUTE metrics (a preferred are much more difficult to operate and require an under- option in Vf). The number of EIs dedicated to measure standing of what the most informative instances are. performance on a concrete task and EIs aiming to mea- Fairness group (Does it treat all test takers sure a capability is similar (Vc), which suggests that the equally?): EIs that are IMPARTIAL account for 80% of the field (at least as represented by these 23 EIs) is undecided data (Fp), which seems a good indicator. However, the on whether to evaluate performance or capabilities. In actual value might be even lower since it is often hard Vo, most EIs were classified as REPRESENTATIVE. How- to detect impartiality. For instance, in an EI for bench- ever, the percentage of BIASED EIs is still significant (circa marking clinical decision support systems, the training 25%), suggesting more efforts may be needed to improve the coverage of current (as well as the ones to come) EIs 10 Easy scoring gives an impression of higher objectivity but some to mitigate/avoid unrepresentative and unreliable assess- subjectivity still exists in the choice of the metric itself. Automated scoring usually helps with repeatability and traceability. set may only include Latin American patients but there are patients from other regions in the test set. Interest- ingly, virtually all the analysed EIs are classified as FULLY- INDEPENDENT (Fo), as values CUSTOMISED and LOOSE are only 0.25 (i.e., these options were only chosen once). The fact that current EIs have the same predetermined specification for all assessed systems is positive and a characteristic that favours fairness in evaluation. Nearly all EIs evaluate the AI systems statically rather than de- velopmentally, possibly because for many applications we care more about the final performance rather than how the system’s performance evolves. Also, it is easier to evaluate the former than the latter. However, DEVEL- OPMENTAL EIs could give more insights about how the models are learning with variations of the input features and different curricula, detect when and why the things go wrong during the training phase, and the trade-off between number of instances, time and performance. In summary, in the validity of the EIs, we found that most of the selected EIs that measure a capability do not necessarily measure the capability reliably. Still, these failures could serve as excellent future references for developing more robust frameworks for evaluating ca- pabilities, and more efforts are required in the years to come. Also, we still need to improve the coverage (i.e., representativeness) in the current EIs. In addition to that, the development of more EIs with real-life settings, may encourage the development of AI systems better able to Figure 2: The distribution of the options in all facets. operate in real-life situations. Regarding the consistency group: albeit most of the selected EIs measure effectively and verifiably, as they are the history of AI, ImageNet, is the only one where the FULLY-CONTAINED and RELIABLE, there is still an evident value PARTIAL is chosen by (at least) half of the raters, lack of diversity in the evaluation process. For instance, and also the one with all BIASED values chosen in cov- we may need more EIs focusing on altering instances by erage (along with LibriSpeech). The disagreement in adding post-processing variations or creating instances partiality may suggest that some sources of partiality are to cover a range of variations intrinsically. Also, more only discovered after the repeated use of an EI and not adaptive ways to test a system should be encouraged, in identified by everyone immediately. GVGAI is peculiar order to evaluate how the system copes in circumstances as a well-thought-out EI, where video games are ablat- with different difficulties. Finally, in terms of fairness, able by several characteristics or difficulty of the game. the selected EIs enjoy low partiality and high objectiv- This is also going in the direction of being procedural, ity. However, more efforts are needed in spurring EIs to but still to a limited extent as per the values assigned by also focus on evaluating how a system performs during the raters for this EI. Finally, those EIs related to natu- the development process. Furthermore, the community ral language, and especially WSC, GLUE, SUPERGLUE, may need more benchmarks that focus on humans and Physical IQa, SocialQA, SQUAD2.0, WikiQA and sW/AG machines working together, since only one out of 23 EIs have high degrees of CONTAMINATED values in facet were done this way. SPECIFICITY. This might be a reflection of how difficult When looking at the distribution of facet values per it is to isolate particular capabilities when using natural EI, we can see that those related to robotics and the phys- language, as some basic natural language competency ical world (Robocup SPL, Robocup@Home and lifelong- requires many other things. And this is reflected by the robots) have more variability in judgeability (MANUAL success of language models recently doing a variety of becomes more frequent), realism (REALISTIC and REAL- tasks [31, 32, 33, 34], since mastering natural language LIFE also become more frequent) and containedness seems to be contaminated by so many other capabilities (PARTIAL-INTERFERENCE becoming more common), as and skills. well as autonomy, with the COUPLED value being cho- sen in some of them. One of the most popular EIs in 6. Discussion and Conclusions here to the order of hundreds in the future, with a more diverse and numerous pool of raters. As an immediate In section 4 we have seen disagreement between CAPA- continuation of this work ourselves, we plan to apply BILITY and PERFORMANCE (Vc), between SPECIFIC and the rubric to further EIs. We hope these facets and the CONTAMINATED (Vs), and between UNSTRUCTURED and rubric describing them can help track the evolution of AI ABLATABLE (Ca). The distributions of these facets in sec- evaluation in the years to come, and identify the facets tion 5 may illustrate a difficulty in interpreting what the where changes are happening or should happen. EI designers intended, i.e., a lack of clarity in the specifi- cation of the EI. It may also be a sign of unresolved issues in AI evaluation: going from task-oriented evaluation Acknowledgments based on performance to more general EIs leads to SPECI- We thank the anonymous reviewers for their comments. FICITY problems. For instance, adding many millions The development of this rubric was performed in the con- of examples can help to coverage but comes with prob- text of the OECD AI and Future of Skills project. Several lems of specificity and more difficulty in understanding versions of the facets were discussed in a series of meet- the role each example plays in the overall score being ings within the project, and especially two meeting in measured by the EI. July 5th 2021 and October 26th, where we presented pre- Being aware of the consistency issues of the rating liminary versions of this rubric. In particular, we thank methodology, we think the set of facets and associated the OECD team (Stuart Elliott, Abel Baret, Margarita rubric, as well as the results of the study of 23 EIs reported Kalamova, Nóra Révai, Mila Staneva) and the rest of ex- in this paper, can be useful for three different kinds of perts and participants (Guillaume Avrin, Lucy Cheke, users in slightly different ways. First, EI creators can see Kenneth D. Forbus, Yvette Graham, Patrick Kyllonen, what design choices in their EI to modify from a first eval- Elena Messina, Britta Rüschoff, Michael Schönstein, Jim uation of its facets and see how it compares to other EIs. Spohrer and Swen Ribeiro). We also thank the OECD for For AI system developers, they can choose the right EIs the funding which made this work possible as well as according to the facet values, and better understand what their encouragement. they can expect from the evaluation and what it means exactly. Finally, for policy-makers and stakeholders from academia, scientific publishing, industry, government References and other strategic organisations, an increasing number of EIs being evaluated and catalogued can serve to under- [1] A. Turing, Computing machinery and intelligence, stand the landscape of AI evaluation much better. This Mind 59 (1950) 433. can help them recognise gaps and limitations, beyond the [2] S. M. Shieber, Principles for designing an AI compe- unstructured collections of benchmark results by metric tition, or why the Turing Test fails as an inducement that have become very useful for meta-analysis but still prize, AI Magazine 37 (2016) 91–96. lacking structure and insight about the EIs themselves. [3] S. M. Shieber, Lessons from a restricted Turing Test, In fact, there have been several studies focusing on Commun. ACM 37 (1994) 70–78. numeric comparison and the evolution of performance [4] P. Hayes, K. Ford, Turing test considered harm- for a range of EIs [35, 36]. These studies see the evolution ful, in: International Joint Conference on Artificial of the progress of AI systems according to some metrics, Intelligence (IJCAI), 1995, pp. 972–977. but we need more analysis on how the evaluation in- [5] A. G. Cohn, On evaluating artificial intelligence struments (benchmarks, competitions, standards, tests, systems: Competitions and benchmarks, in: AI etc.) are also evolving, and whether they are meeting the and the Future of Skills, Volume 1 Capabilities and demands of a more comprehensive evaluation beyond Assessments, OECD, 2021, pp. 238–251. some simple metrics. This was our main motivation. [6] AERA, APA, NCME, et al., Standards for educa- We have faced some difficulties in determining the tional and psychological testing, American Educa- criteria for inclusion of EIs, the isolation of some facets tional Research Association, 2014. that were difficult to understand or confused with others, [7] H. J. Levesque, The Winograd Schema Chal- and finding a protocol of application that is sufficiently lenge, in: Logical Formalizations of Common- robust but at the same time requiring a limited number sense Reasoning, Papers from the 2011 AAAI of raters and other resources. We plan this setting to be a Spring Symposium, Technical Report SS-11-06, live endeavour, with some facets being added, changed or Stanford, California, USA, March 21-23, 2011, AAAI, removed in new versions of the rubric. However, some 2011. URL: http://www.aaai.org/ocs/index.php/SSS/ stability in names, facet values and facet description is SSS11/paper/view/2502. needed to be able to compile the results of different rating [8] M. G. Bellemare, Y. Naddaf, J. Veness, M. Bowling, studies over time, increasing from the 23 EIs evaluated The arcade learning environment: An evaluation platform for general agents, J. Artif. Intell. Res. and Computational Intelligence 3 (2019) 1–191. 47 (2013) 253–279. URL: https://doi.org/10.1613/jair. https://gaigresearch.github.io/gvgaibook/. 3912. doi:10.1613/jair.3912 . [18] Y. Bisk, R. Zellers, R. LeBras, J. Gao, Y. Choi, PIQA: [9] A. Wang, A. Singh, J. Michael, F. Hill, O. Levy, reasoning about physical commonsense in natural S. R. Bowman, GLUE: A multi-task benchmark and language, in: The Thirty-Fourth AAAI Conference analysis platform for natural language understand- on Artificial Intelligence, AAAI 2020, The Thirty- ing, in: 7th International Conference on Learn- Second Innovative Applications of Artificial Intel- ing Representations, ICLR 2019, New Orleans, LA, ligence Conference, IAAI 2020, The Tenth AAAI USA, May 6-9, 2019, OpenReview.net, 2019. URL: Symposium on Educational Advances in Artificial https://openreview.net/forum?id=rJ4km2R5t7. Intelligence, EAAI 2020, New York, NY, USA, Febru- [10] A. Wang, Y. Pruksachatkun, N. Nangia, A. Singh, ary 7-12, 2020, AAAI Press, 2020, pp. 7432–7439. J. Michael, F. Hill, O. Levy, S. R. Bowman, Super- URL: https://ojs.aaai.org/index.php/AAAI/article/ glue: A stickier benchmark for general-purpose view/6239. language understanding systems, in: H. M. Wal- [19] N. Froleyks, M. Heule, M. Iser, M. Järvisalo, M. Suda, lach, H. Larochelle, A. Beygelzimer, F. d’Alché-Buc, SAT competition 2020, Artif. Intell. 301 (2021) E. B. Fox, R. Garnett (Eds.), Advances in Neural 103572. URL: https://doi.org/10.1016/j.artint.2021. Information Processing Systems 32: Annual 103572. doi:10.1016/j.artint.2021.103572 . Conference on Neural Information Processing [20] R. Zellers, Y. Bisk, A. Farhadi, Y. Choi, From Systems 2019, NeurIPS 2019, December 8-14, 2019, recognition to cognition: Visual commonsense Vancouver, BC, Canada, 2019, pp. 3261–3275. URL: reasoning, in: IEEE Conference on Computer https://proceedings.neurips.cc/paper/2019/hash/ Vision and Pattern Recognition, CVPR 2019, Long 4496bf24afe7fab6f046bf4923da8de6-Abstract.html. Beach, CA, USA, June 16-20, 2019, Computer Vision [11] J. Deng, W. Dong, R. Socher, L. Li, K. Li, L. Fei- Foundation / IEEE, 2019, pp. 6720–6731. URL: Fei, Imagenet: A large-scale hierarchical image http://openaccess.thecvf.com/content_CVPR_2019/ database, in: 2009 IEEE Computer Society Con- html/Zellers_From_Recognition_to_Cognition_ ference on Computer Vision and Pattern Recogni- Visual_Commonsense_Reasoning_CVPR_2019_ tion (CVPR 2009), 20-25 June 2009, Miami, Florida, paper.html. doi:10.1109/CVPR.2019.00688 . USA, IEEE Computer Society, 2009, pp. 248–255. [21] Assembly performance metrics and URL: https://doi.org/10.1109/CVPR.2009.5206848. test methods, https://www.nist.gov/ doi:10.1109/CVPR.2009.5206848 . el/intelligent-systems-division-73500/ [12] J. Renz, X. Ge, M. Stephenson, P. Zhang, AI meets robotic-grasping-and-manipulation-assembly/ angry birds, Nat. Mach. Intell. 1 (2019) 328. URL: assembly, 2018. https://doi.org/10.1038/s42256-019-0072-x. doi:10. [22] A. L. Maas, R. E. Daly, P. T. Pham, D. Huang, A. Y. 1038/s42256- 019- 0072- x . Ng, C. Potts, Learning word vectors for sentiment [13] S. A. Gaggl, T. Linsbichler, M. Maratea, S. Woltran, analysis, in: D. Lin, Y. Matsumoto, R. Mihalcea Design and results of the second international com- (Eds.), The 49th Annual Meeting of the Association petition on computational models of argumenta- for Computational Linguistics: Human Language tion, Artif. Intell. 279 (2020). URL: https://doi.org/10. Technologies, Proceedings of the Conference, 19-24 1016/j.artint.2019.103193. doi:10.1016/j.artint. June, 2011, Portland, Oregon, USA, The Association 2019.103193 . for Computer Linguistics, 2011, pp. 142–150. URL: [14] The robocup standard platform league, https://spl. https://aclanthology.org/P11-1015/. robocup.org/, 1998. [23] M. Sap, H. Rashkin, D. Chen, R. LeBras, Y. Choi, [15] The robocup@home league, https://athome. Socialiqa: Commonsense reasoning about social in- robocup.org/, 2006. teractions, CoRR abs/1904.09728 (2019). URL: http: [16] V. Panayotov, G. Chen, D. Povey, S. Khudanpur, Lib- //arxiv.org/abs/1904.09728. arXiv:1904.09728 . rispeech: An ASR corpus based on public domain [24] M. R. Genesereth, N. Love, B. Pell, General audio books, in: 2015 IEEE International Confer- game playing: Overview of the AAAI competition, ence on Acoustics, Speech and Signal Processing, AI Mag. 26 (2005) 62–72. URL: https://doi.org/10. ICASSP 2015, South Brisbane, Queensland, Aus- 1609/aimag.v26i2.1813. doi:10.1609/aimag.v26i2. tralia, April 19-24, 2015, IEEE, 2015, pp. 5206–5210. 1813 . URL: https://doi.org/10.1109/ICASSP.2015.7178964. [25] P. Rajpurkar, J. Zhang, K. Lopyrev, P. Liang, Squad: doi:10.1109/ICASSP.2015.7178964 . 100, 000+ questions for machine comprehension [17] D. Perez-Liebana, S. M. Lucas, R. D. Gaina, J. To- of text, in: J. Su, X. Carreras, K. Duh (Eds.), gelius, A. Khalifa, J. Liu, General video game arti- Proceedings of the 2016 Conference on Empirical ficial intelligence, Synthesis Lectures on Games Methods in Natural Language Processing, EMNLP 2016, Austin, Texas, USA, November 1-4, 2016, The arXiv:2108.07258, 2021. Association for Computational Linguistics, 2016, [35] F. Martinez-Plumed, P. Barredo, S. O. Heigeartaigh, pp. 2383–2392. URL: https://doi.org/10.18653/v1/ J. Hernandez-Orallo, Research community dynam- d16-1264. doi:10.18653/v1/d16- 1264 . ics behind popular AI benchmarks, Nature Machine [26] Y. Yang, W. Yih, C. Meek, WikiQA: A challenge Intelligence 3 (2021) 581–589. dataset for open-domain question answering, in: [36] A. Barbosa-Silva, S. Ott, K. Blagec, J. Brauner, L. Màrquez, C. Callison-Burch, J. Su, D. Pighin, M. Samwald, Mapping global dynamics of bench- Y. Marton (Eds.), Proceedings of the 2015 Confer- mark creation and saturation in artificial intelli- ence on Empirical Methods in Natural Language gence, arXiv preprint arXiv:2203.04592 (2022). Processing, EMNLP 2015, Lisbon, Portugal, Septem- ber 17-21, 2015, The Association for Computa- tional Linguistics, 2015, pp. 2013–2018. URL: https: //doi.org/10.18653/v1/d15-1237. doi:10.18653/v1/ d15- 1237 . [27] R. Zellers, Y. Bisk, R. Schwartz, Y. Choi, SWAG: A large-scale adversarial dataset for grounded com- monsense inference, in: E. Riloff, D. Chiang, J. Hockenmaier, J. Tsujii (Eds.), Proceedings of the 2018 Conference on Empirical Methods in Natu- ral Language Processing, Brussels, Belgium, Octo- ber 31 - November 4, 2018, Association for Compu- tational Linguistics, 2018, pp. 93–104. URL: https: //doi.org/10.18653/v1/d18-1009. doi:10.18653/v1/ d18- 1009 . [28] A. Marot, B. Donnot, G. Dulac-Arnold, A. Kelly, A. O’Sullivan, J. Viebahn, M. Awad, I. Guyon, P. Pan- ciatici, C. Romero, Learning to run a power network challenge: a retrospective analysis, in: H. J. Es- calante, K. Hofmann (Eds.), NeurIPS 2020 Competi- tion and Demonstration Track, 6-12 December 2020, Virtual Event / Vancouver, BC, Canada, volume 133 of Proceedings of Machine Learning Research, PMLR, 2020, pp. 112–132. URL: http://proceedings. mlr.press/v133/marot21a.html. [29] L. Yang, Sdkd: Saliency detection with knowledge distillation, https://lifelong-robotic-vision.github. io/competition/papers/PekingU_linyang.pdf, 2019. [30] C.-C. Hsu, B. A. Sandford, The Delphi technique: making sense of consensus, Practical assessment, research, and evaluation 12 (2007) 10. [31] J. Devlin, M.-W. Chang, K. Lee, K. Toutanova, Bert: Pre-training of deep bidirectional transform- ers for language understanding, arXiv preprint arXiv:1810.04805 (2018). [32] T. B. Brown, B. Mann, N. Ryder, M. Subbiah, J. Ka- plan, P. Dhariwal, A. Neelakantan, P. Shyam, G. Sas- try, A. Askell, et al., Language models are few-shot learners, in: Advances in Neural Information Pro- cessing Systems, volume 33, 2020, pp. 1877–1901. [33] D. Hendrycks, S. Basart, S. Kadavath, M. Mazeika, A. Arora, E. Guo, C. Burns, S. Puranik, H. He, D. Song, J. Steinhardt, Measuring coding challenge competence with APPS, 2021. arXiv:2105.09938 . [34] R. Bommasani, et al., On the opportunities and risks of foundation models, arXiv preprint