1. Introduction

Y. Moros-Daval);

Anthony G Cohn

a.g.cohn@leeds.ac.uk 0 3

José Hernández-Orallo

0 4

Julius Sechang Mboli

mboli4god@gmail.com 0 1

Yael Moros-Daval

0 4

Zhiliang Xiang

xiangz6@cardif.ac.uk 0 2

Lexin Zhou

lzhou@inf.upv.es 0 4 0 Evaluation Instruments, Comparison of Evaluation Instruments, Categorisation of Evaluation Instruments , Artificial Intelli- 1 Faculty of Engineering and Informatics, University of Bradford , UK 2 IROHMS, School of Computer Science and Informatics, Cardif University , UK 3 School of Computing, University of Leeds, UK; and the Turing Institute , UK 4 VRAIN, Universitat Politècnica de València , Spain

2055

000 0 0002

The current and future capabilities of Artificial Intelligence (AI) are typically assessed with an ever increasing number of benchmarks, competitions, tests and evaluation standards, which are meant to work as AI evaluation instruments (EI). These EIs are not only increasing in number, but also in complexity and diversity, making it hard to understand this evaluation landscape in a meaningful way. In this paper we present an approach for categorising EIs using a set of 18 facets, accompanied by a rubric to allow anyone to apply the framework to any existing or new EI. We apply the rubric to 23 EIs in diferent domains through a team of raters, and analyse how consistent the rubric is and how well it works to distinguish between EIs and map the evaluation landscape in AI.

1. Introduction

Ever since researchers started building AI systems, they have wanted to evaluate them, either against human benchmarks (such as playing humans experts at Chess or other games) and/or against other AI systems. Finding good benchmarks for evaluating systems, and conducting tests is harder than it might seem, particularly since we believe we have good methods for evaluating human intelligence, via standard tests and examinations.

There have been many tests proposed for evaluating AI systems. Probably the most famous of these of course

is known as the Turing Test[1]. There have been various

Turing Test competitions, of which the best known is the annual Loebner Prize competition; the results have been sometimes entertaining, and a way of promulgating ideas about AI to the general public, but it is hard to argue that

nEvelop-O LGOBE https://jsmboli.github.io/jsmboli/ (J. S. Mboli); https://zl-xiang.github.io/ (Z. Xiang); https://lexzhou.github.io/ (L. Zhou)

0000-0002-7652-8907 (A. G. Cohn); 0000-0001-9746-7632 (J. Hernández-Orallo); 0000-0003-1708-3052 (J. S. Mboli); can be found in [5].

The situation today is that there are thousands of challenges in almost all areas of AI. They are increasing in complexity and diversity, as AI techniques evolve likewise. Because of this, it is hard to analyse this evaluation landscape in a meaningful way. Motivated by this need, we present and discuss an approach to categorising benchmarks, competitions, tests and evaluation standards, jointly referred to as AI evaluation instruments (EI). We do this categorisation via a set of 18 facets, which we believe will be valuable in distinguishing and evaluating diferent proposals for evaluating AI systems. These facets, and an accompanying rubric to facilitate choosing appropriate values, are described in section 2.

We will classify EIs using the facets in order to (a) evaluate how well the facets work in general and (b) to what extent they help mapping the landscape of EIs and distinguish their diferences. This may help inform how the options in brackets. Some options indicate ‘(specify)’, much we can translate from the facet values to guide the which means that the rater must indicate a (freetext) design of future EIs. We do not imagine there can be a value for that option. The full description of the facets single universal evaluation instrument, or even a battery usually include some examples and further clarifications 2. for each domain (vision, reasoning, etc.); certainly that Here we only include the basic definition of each of them. ideal has eluded the community so far. We do not even We use colours (blue and black) that are indicative, with aspire to find facet values that are valid for all EIs but blue referring to the preferred or most challenging case, our proposed work may help in directing future eforts in general. However, for some facets a blue value may in the evaluation of AI systems. make no sense, or we do not believe that one value is

Since it is infeasible in a reasonable amount of time to ‘better’ than any other, so these facets have no coloured apply this categorisation to the thousands of EIs in the lit- facet value(s). erature, here we cover 23 EIs (see Table 2). By evaluating • Vp - Purpose [RESEARCH, CONFORMITY, OTHER (speca reasonable number of carefully chosen examples, we ify)]: Is the benchmark meant to foster research or hope to give a fair picture of the extent to which the as- development, or to certify whether an AI system conpects of AI appraised by the facets are being tested in the forms with some level or standard? selected examples. Beyond the insights that we extract • Vc - Capability [TASK-PERFORMANCE (specify), CAPAfrom this selected set of EIs, this paper and the rubric we BILITY (specify)]: Does the EI just measure observed have developed for the diferent facets should serve as (aggregated) performance on a TASK (e.g., protein folda reference for third parties (e.g., other researchers) to ing, credit scoring) or is the EI designed to also measure analyse other EIs. a CAPABILITY (e.g., object permanence, dealing with

The rest of the paper is organised as follows. Sec- negation)? tion 2 presents the 18 facets and a rubric which explains • Vf - Reference [ABSOLUTE, RELATIVE (specify)]: Are how facet values should be chosen. Next, in section 3, results reported as an absolute metric (criterionwe discuss the criteria for selecting the 23 EIs and the referenced) or are they reported as a relative (percentmethodology the raters used to apply the rubric. Section age) metric to a reference (norm referenced), e.g., hu4 discusses the level of disagreement between raters for man performance? each facet and EI, and how the methodology and the • Vo - Coverage [BIASED (specify), REPRESENTAnumber of raters was adapted based on these observa- TIVE]: Does the EI cover a BIASED or unbiased tions. Section 5 analyses the ratings of the 23 EIs, and (REPRESENTATIVE) distribution of what is meant to what they reveal about this group of EIs. Finally, section be measured? 6 closes with some general discussion and possible future • Vs - Specificity [SPECIFIC, CONTAMINATED]: Are the work. results precisely aligned with what is meant to be measured or contaminated by other skills or tasks? • Vl - Realism [TOY, GAMIFIED, REALISTIC, REAL-LIFE]: 2. Characterising AI Evaluation To what extent is the EI a toy problem, a complex Instruments gamified problem, is it a realistic setting (e.g., but still in a simulated scenario, a lab or testing facility) or is the evaluation itself happening in real life3? • Cj - Judgeability [MANUAL, AUTOMATED, MIXED]: Is scoring manual (e.g., through human questionnaires or judges) or automated (e.g., correct answers or optimality function) or a mixture? • Cc - Containedness [FULLY-CONTAINED, PARTIAL

INTERFERENCE (specify), NOT-CONTAINED (specify)]:

Once started, is the testing isolated from external factors or interference possibly having an efect on the results (human participants, online data, weather, etc.), or is there some partial interference not afecting the results significantly or is it dependent of external resources and conditions? We looked for existing features or dimensions to characterise EIs, but unfortunately we did not find any systematic account in AI, other than concepts such as reproducibility, realism, coverage and specificity, usually referred to with other names and applied to a single EI. We found more dimensions and a more systematic coverage of evaluation instruments in the area of psychological testing. As a result, we have introduced a new set of facets, but when possible, the terminology is based on the common use in AI, but also incorporating terms and concepts from the Standards for Educational and Psychological Testing by the American Educational Research Association [6].

The following list1 proposes 18 facets to characterise existing and future EIs for AI. Each facet is followed by 1Each facet has both a name and a two letter acronym, whose initial letter is V, C or F, the reason for which will become clear later.

2The latest version of the rubric can be found in https://tinyurl.

com/mr2bv5hb

3REAL-LIFE does not mean a final or specific product in operation. It can also happen in very early stages of research, such as evaluating prototype chatbots in a real social network. • Cp - Reproducibility [NON-REPRODUCIBLE, STOCHAS- the system within the test?

TIC, EXACT]: Is the evaluation non-reproducible, with • Fu - Autonomy [AUTONOMOUS, COUPLED (specify), results biased or spoiled if repeated; does the EI have COMPONENT]: Is it measuring an autonomous sysstochastic components leading to diferent interac- tem, coupled with other systems (e.g., humans) or as a tions; or are the results completely reproducible, i.e. component? can exactly the same test (inputs, interaction, etc.) be The facets above can be grouped into three main catgenerated again for another (or the same) competitor? egories following the three main groups given by the • Cl - Reliability [RELIABLE, NON-RELIABLE, N/A]: Does Standards for Educational and Psychological Testing [6]: the evaluation present suficient repetitions, episode validity, reliability/precision and fairness. We use these length or number of instances to give low variance three major groups to give some structure to the facets for the same subject when applied again (test-retest above. Roughly, these groups deal with what is measured, reliability)? If the testing methodology or the common how it is measured and who is measured, respectively. use of the EI is not clear then N/A may be the most • Validity group (Does it measure what we want to meaappropriate facet value. sure?): Vp, Vc, Vf, Vo, Vs, Vl • Cv - Variation [FIXED, ALTERED, PROCEDURAL]: Is • Consistency (Reliability/Precision) group (Does it meathe evaluation based on fixed datasets; have the in- sure it efectively and verifiably?: Cj, Cc, Cp, Cl, Cv, stances been altered by adding post-processing varia- Ca tions (noise, rotations, etc.); or have they been created • Fairness group (Does it treat all test takers equally?): (e.g., using procedural generation4)? Fn, Fm, Fp, Fo, Fr, Fu • Ca - Adjustability [UNSTRUCTURED, ABLATABLE, Some of these are closely related, such as {Cv,Ca,Vo} or ADAPTIVE]: Is the analysis of results on the set of in- {Fo,Cp}. The term accommodation in [6] is “used to destances unstructured; or has the EI identified a set note changes with which the comparability of scores is of meta-features such as dificulty or dimension that retained, and the term modification is used to denote could be used to analyse the results by these dimen- changes that afect the construct measured by the test”. sions (ablatable); or are these meta-features used to This is related to Vs, Cv, Fo and Cc, and also to the term adaptively or adversarially choose the instances to test “measurement invariance”, which is very important here more informatively (adaptive)? to see if accommodations of the same test could evalu• Fn - Antecedents [CREATED, RETROFITTED (specify)]: ate the same construct for diferent AI systems and even Is it devised on purpose for AI or adapted from tests humans.

designed to test humans. • Fm - Ambition [SHORT, LONG]: When the EI was created, was it aiming at the short term (improving on 3. EI Selection and Rating the SOTA) or long term (more ambitious goals)? Methodology • Fp - Partiality [PARTIAL (specify), IMPARTIAL]: Does the EI favour particular technologies, conditions or Now that the facets and the rubric have been explained, cultures that should not have an influence on the result we proceed to discuss how the EIs were selected, what of the evaluation5? the final selection was, and what protocol we followed • Fo - Objectivity [LOOSE, CUSTOMISED, FULLY- in assigning EIs to the raters.

INDEPENDENT]: Is it loosely defined, customised to each participant or does the EI have a predetermined 3.1. EI Selection independent specification 6. • Fr - Progression [STATIC, DEVELOPMENTAL]: Is the score measuring a capability at one particular moment or is it evaluating the development of the capability of

We considered evaluation instruments with the following

criteria for inclusion: • Potential interest to understand the future of AI skills:

An EI might be regarded as being of interest if systems which perform well on it can be regarded as indicating a noteworthy change in the capabilities of AI in general. In other words, progress in this EI requires significant enhancement of AI techniques beyond the specific requirements of the EI. • Diversity in the kind of task: We tried to cover a variety of domains, formats and types of problems (vision, natural language, competitions, datasets, supervised, etc).

4Although we have coloured PROCEDURAL, we recognise that procedural may not always be better and can lead to problems if variations are not in an appropriate proportion. Also, generated data may just lead to a learning algorithm reverse-engineering the generator.

5Vo-Coverage is about the domain, whilst Fp-Partiality is about how the EI may favour some test-takers over others.

6LOOSE refers to cases when evaluation is very open, e.g., a robotic-domain EI where we evaluate on a satisfactory interaction with the user, but not even a clear questionnaire is defined. FULLYINDEPENDENT could treat diferent groups diferently if there is a reason for equality of treatment. Consistently Agreed Moderately Agreed Often Diverged 2 options Fr, Fn Vf, Fp Vc, Vo, Vs, Fm 3 options Vp , Cj, Cc, Fo, Fu Cp, Cv Cl, Ca 4 options Vl • Popularity: How many teams have already used this

EI? How many published papers refer to it? We can use proxies for this, such as citations to the original papers introducing the EI, the number of results on websites such as paperswithcode.com. We also have to consider that industry-related EI may be less popular than research-oriented EIs. However, given the number of EIs selected, we repeat domains and cover just a few areas (e.g., NLP, vision, robotics) without being comprehensive for all possible domains. • Currency: we prefer EIs still in active use or recently introduced, rather than those which have fallen out of use.

The source of the EIs was mostly repositories7 and surveys, institutions such as NIST8 and LNE9, and competitions at AI conferences. Then, we identified possible gaps in terms of domains or whether we expect that the answers for some facets are going to be too similar. We also considered whether we would expect to get diversity in the values in blue for the facets, so that we get diferent levels of quality according to this colour code. Note that at the time of selection we could of course only roughly estimate how many blue categories we might get for each EI. Since we expected to learn more about the categorising of EIs as categorisation proceeded, we did not choose all EIs in advance but selected them incrementally. The 23 selected EIs are shown in Table 2.

These EIs cover a good distribution of benchmarks, competitions and datasets, although some of them can be considered to be in two of these categories. The term ‘test’ to refer to an EI is less usual. About half of the 23 EIs require the use of language in the inputs and/or outputs, and about one half of them require some kind of perception (mostly computer vision), with some overlap in these two groups. Only a few of the EIs are related to navigation and robotics, in virtual (e.g., video games) or physical environments, and a small number are related to more abstract capabilities or problems related to planning or optimisation.

7http://paperswithcode.com, http://kaggle.com, https://zenodo.org/record/4647824#.YV7CPdrMKUk, https: //www.eff.org/ai/metrics, https://en.wikipedia.org/wiki/List_of_ datasets_for_machine-learning_research, http://www.chalearn.org.

8https://www.nist.gov/programs-projects/ ai-measurement-and-evaluation 9https://www.lne.fr/en/testing/ 3.2. Rating Methodology We devised a protocol to refine and validate the rubric, but also to cover as many EIs as possible, according to the number of raters we had available. We explain the protocol below, but we note that this protocol can be adapted to other situations or can incorporate ideas from consensus-based ratings or the Delphi method [30]. First, two of the authors of this paper (A.C. and J.H-O.) acted as coordinators for the rating process. A total of four raters were chosen. Raters were AI-related undergraduate and graduate students, and were recruited through a selection process and interviews. They are the other four authors of this paper (J-S.M., Y. M-D, Z.X. and L.Z.). Once the raters were appointed, each rater was given some meta-information about each EI (acronym, name, major sources, what it measures, etc.) and had to complete some other general information about each EI. They were also asked some information about their own completion, such as time taken (in hours).

We established three batches, covering 2, 11 and 10 EIs respectively, in the order they are presented in Table 2. The first two EIs had already been used by the coordinators in developing the list of facets and their values.

All the subsequent raters started of on these two EIs too and were given feedback on their chosen values before proceeding to any further EIs. We refer to these two EIs as “Batch 1”. The next 11 EIs are referred to as “Batch 2”. These two batches were done by all four raters, independently. After the analysis of consistency we deemed suficient to only have two raters per EI. Then, a final set of 10 EIs, referred to as “Batch 3”, were each rated by just two raters, for reasons of economy, since we already had reasonable inter-rater consistency after the end of batches 1 and 2. The two raters for each EI were assigned so that all raters would have five EIs, and across their five EIs, they co-rated with all the other three raters (i.e., one EI with one other rater and two EIs with each of the other raters). In this first stage, they worked independently, not sharing values for any of the facets, and only reporting questions and partial results to the coordinators.

There were some changes of the rubric between batches, especially clarifying the description of some of the facets, and in a few cases, changes in the number evaluation-artificial-intelligence-systems LibrispeechSL12 [16] GVGAI [17] PIQA [18] SAT [19] VCR [20] Assembly [21] IMDb [22] SocialIQA [23] GGP [24] SQUAD2.0 [25] WikiQA [26]

Domain LU, CS, reasoning

Aim It was specifically targeted to evaluate common sense reasoning, as an alternative to the Turing test, arguing conceptual and practical advantages The original goal was to evaluate “general, domain-independent AI technology”, by using a diversity of video games, although what it measures more specifically is unclear.

The goal of GLUE and superGLUE (an improvement/modified version of GLUE) is to measure the performance (e.g. accuracy, F1-score) of an AI system in natural language understanding tasks (Single-Sentence Tasks, Similarity and Paraphrase Tasks, and Inference Tasks) in English.

Aims to measure the visual recognition capability for object recognition, image classification, and object localisation. The images can contain diferent numbers of objects (e.g. mammal, bird, fish, vehicle, furniture, tool, flower, fruit, etc.), occlusions, and clutters (i.e. diversity and noise).

Measures the planning capability of an agent in a large action space, without knowing of the physical parameters of objects, situation given by Angry Birds.

Aims to measure/compare the performance of diferent solvers regarding argumentation (particularly, reasoning problem that requires logic).

The aim is to measure & promote improvements in multi-robot (humanoid) systems by playing soccer matches with robots aims to measure the performance of the developed AI robots in providing service with assistive robot technology with high relevance for future personal domestic applications.

Aims to provide freely available read speech corpus in English that is suitable for training and testing speech recognition systems.

Aimed to systems that can perform well in multiple video games, possibly without knowing the game in advance and with little to no specific domain knowledge, as an approximation to artificial general intelligence Aims to measure physical interaction reasoning about both the prototypical use of objects (e.g., shoes are used for walking) and non-prototypical but practically plausible use of objects (e.g., shoes can be used as a doorstop).

It targets language representations of knowledge traditionally only seen or experienced.

Aims to keep progress & further improve the performance & robustness of SAT solvers, with a history dating back to the early 90s, thanks to the persistent eforts of the SAT community.

It aims to measure the ability to infer what is happening in a picture (people’s actions, goals, etc.) from visual signs which are obvious for humans.

Identifying key competencies and characteristics of robotic systems using a robust set of formalized evaluations and benchmarks. To help to match robotic hand capabilities to end-user needs as well as to help provide developers and researchers insight for improving their hardware and software designs Detecting the sentiment of a piece of text Aimed to measure the social and emotional intelligence of computational models through multiple choice question answering General game playing (GGP) is the design of artificial intelligence programs to be able to play more than one game successfully.

It aims to measure reading comprehension abilities that allows a system to get a correct answer to a given question when the solution can be extracted from the text or abstain from answering otherwise WIKIQA is a dataset for opendomain question answering Year 2016 2013 2018 2019 2010 2010 2015 1998 2006 2015 2014 2019 2002 2019 2017 2011 2019 2005 2018 2014 competition

VG;general AI; PN competition competition competition competition dataset benchmark dataset competition dataset competition dataset benchmark competition dataset benchmark dataset dataset, benchmark sW/AG [27] NLI, CR Aims to evaluate the performance of a system in grounded commonsense 2018 inference (reasoning about a situation and anticipate what might come next) by answering multiple choice questions L2RPN [28] competition SG, AI, PG, PN This challenge aims at testing the potential of AI to address this important 2012 real-world problem for our future.

Lifelong- competition robotics, CV, RV Provides a robotic vision dataset collected from real time environments to 2019 Robots [29] accelerate both research and applications of visual models for robotics. Abbreviations: HRIC = Human-Robot-Interaction and Cooperation; NMDE = Navigation and Mapping in dynamic environments; CV = Computer Vision, ABP = Adaptive Behaviors, planning; AA = abstract argumentation; CL = computational logic; VG = video games; KRRP = knowledge representation; reasoning; planning; RCRPVMASS robotics; cooperation; real-time planning; vision; multiagent systems; strategy; LU = Language understanding; CS = common sense; RM = Robotics in Manufacturing; ARH = Adaptive Robot hands, MPLT = Manipulation planning based on learning techniques;DiHM = Dexterous in-hand manipulation; RGVELO = Robust grasping with various everyday life objects; SI = social interaction, SIn = social intelligence, EI = emotional intelligence, IR = inferential reasoning; CR = commonsense reasoning;VR = visual recognition; PN = planning and navigation, SG = Smart Grids, PG = Power Grids, PN = Power networks, PCU = physical commonsense understanding, NLI = natural language inference, RV = Robotic vision and/or name of the options. Whenever a change was The pattern of agreement or disagreement amongst introduced, the raters were informed and had to revisit the raters tend to vary depending on several factors such their ratings for previous batches. as facet complexity, available information on the EI, and

In a second and final stage of the process, the coor- so on. In particular, we observe the following: dinators allowed the raters to exchange opinions, but • Fr, Fn, Vp, Cj, Cc, Fo, Fu are consistently agreed across they were not asked to reach a consensus, just to identify all batches, with very few disagreements. possible misunderstandings. From this discussion, a few • Vf, Vl, Cp, Cv, Fp appear to be moderately agreed and ratings were modified. Unless explicitly stated, we refer supported by a majority (≥ 75%). Notably, Vl has the to these final ratings in the rest of the paper. largest number of value options, but still agreed well by a majority.

• While selections on Vo, Vs and Cl with binary options, 4. Analysis of Rater Consistency are two of the least agreed ones.

It is not surprising that some of the facets consistently As noted above, the 1st and 2nd batches difer from batch reached consensus considering the facet values tend to 3 because the former had four raters whilst the latter only distribute towards one single selection (detailed in Sectwo. Thus, in the former case, a majority agreement can tion 5). For instance, as we will see in the following be formed with three or four raters agreeing, whilst in section, RESEARCH is picked for the Vp facet with only batch 3 only when both raters agree; hence ‘majority’ is one disagreement for all rounds. This might reflect the less statistically significant for the 3rd batch. For simplic- fact that some EIs do not have much variability in their ity, we will use round A and round B respectively when options. For example, most EIs are indeed proposed for referring to the first two batches and the 3rd batch. As the purpose of research (Vp), and given the low varishown in Figure 1, the level of agreement coincides to a ability in the values there cannot be much disagreement great extent when comparing the results from all batches (the variance of a Bernoulli distribution). As the variabil(Figure 1, top) with the individual ones from round A ity of facets increases, choosing answers for the facets (Figure 1, middle) and round B (Figure 1, bottom). It can might require more EI-specific domain knowledge from be expected that those facets with more possible values the raters. For instance, to make justifiable decisions for (4) might have more disagreements than those with only facets like Vo and Vs, raters often need to seek related two possible values, simply for statistical reasons. We can literature for support when the answers were not clear see that in fact this is not having a big efect, as shown from the specifications of EIs. Whether an EI is specific in Table 1. (Vs) and general (Vo) enough for the measuring of certain

Facets Agreements for Both Rounds capabilities is indeed hard to judge depending solely on tsounC21123005505 21223 19413 15821 110318 111215 15820 21223 21222 18520 17619 18522 112115 14920 112117 17618 21222adm2gi1sa2ra2jeog3errmiet23ye0em2n2tent tftshroeoleMevmscaoptldireuoeicenfeoirficsvdea.eintrvi,toesnsruogsbeu.jnrAecccseteisssvu.mictyThig,hohiifnstaflomefraamicdgeahttttoicobodneuisltdaahgaarrtleesiaeossmeocxonetnnartbatrsclietboeunedtxeVp Vc Vf Vo Vs Vl Cj Cc Cp Cl Cv Ca Fn Fm Fp Fo Fr Fu planation for inconsistent selections in Vc, Ca and Fm since they allow raters more space for subjective intertsonuC111102468462 11213 85 9 49 12 67 12 67 10F1a30c1e2t1s121A3gr11e21e2m58e1n3ts67f1o0r t76h1e3 R85o9un1d211A2 130 9 13012 11212adm1gi1s2ara1jeog3errmiet13ye0em1n2tent sopFpmtrmeraeerwnnftoiadtassrtiyinmoocsgftea.nannnF(sN.ocsserLotWaUme(tsxe)hheadchoimloroeitupnw-rlltdeetebh,lraeeameivnmi)aEnEnoItaeItrstrfpmipionmerrefceoapnitficsrreauamodttruviaiiironntnanigogldsnlai,safegttnrwaehegtne.neurts.-tsateo’.gdfscee-tVatgauhpcrtneaeea-dbe-anesilrrdit-

Vp Vc Vf Vo Vs Vl Cj Cc Cp Cl Cv Ca Fn Fm Fp Fo Fr Fu ties regarding NLU (long-term); object recognition could 10 Facets Agreements for Round B adgisraegermeeemntent bbeotahrgoupetdioans vaavriisaubaillictyapaanbdilistuybojrecatisvpietcyificmtaadske.tHheavtihnrgee tsonuC6482 10 4 9 6 5 8 10 10 7 9 9 6 8 8 6 10 10 10 wlfPaaEitcRteehFdtOs,daRitsnhMadegAralNeeadeCsmiEstaeiasgngrstreeeileneemcdoteetonhdnteferoisnsr.. VoFAcno,lersthomine,assvtyoaamlnbueceeeoaf,acfwccthoehtemesnVpasaTrfneAaiScreeKedt-

Vp Vc Vf Vo Vs Vl Cj Cc Cp Cl Cv Ca Fn Fm Fp Fo Fr Fu is more likely to be SPECIFIC. As such, Vs is more likely to Figure 1: Agreements on facet value ratings for the 23 EIs be diverged if disagreement occurred on Vc. This might and rounds A and B. also account for the high diverging rate of facets in the

Validity group.

In summary, apart from the statistical reason given by ments. Surprisingly, only around half of EIs were SPEthe number of values and their variability, the causes for CIFIC (Vs), i.e., another half were CONTAMINATED. All disagreement can be grouped into three blocks: the EIs that were designed for TASK-PERFORMANCE are • Similarity between facet values: The closeness or simi- always SPECIFIC (this is suggested in the rubric) but more larity between facet options might have also reduced interestingly, most EIs designed to measure CAPABILITY the chance of picking the right option. For example, are CONTAMINATED (i.e., the results do not completely for the facet Vl - Realism has four options (TOY, GAMI- align with what is meant to be measured). More efort is FIED, REALISTIC and REAL-LIFE), and it is not always needed to encourage reliable and robust methodologies easy to distinguish between REALISTIC and REAL-LIFE. to evaluate the capability of the AI systems, although • Insuficient Details: For many EIs, the information or we recognise sometimes it is inevitably hard to measure details provided by the organisers of the competition, reliably certain capabilities (e.g., common-sense reasonthe test or the datasets in the EI is not suficient to ing). With regard to realism (VI), REALISTIC EIs account understand what the EI is actually measuring. Other for a predominant proportion (circa 80%), implying conEIs are well documented and have published articles siderable focus on measuring systems solving practical that make it easy to obtain meta-information and the problems, but the evaluation is not in an actual real-life facets values for such EIs. scenario; thus most EIs focus on evaluating the systems • Conflicting Information: One of the factors that did in simulated scenarios or scenarios which are an abstracnot help is the source of information about each EI. tion of a real-world setting.

For some EIs, there is perhaps too much information Consistency group (Does it measure it efectively and many papers using them, and they do not always and verifiably?): Nearly all EIs are FULLY-CONTAINED understand the same thing or use it in the same way. (Cc), implying current EIs enjoy high independence from One paper or website might be talking about task per- external factors during the assessment) and RELIABLE formance while other sources talk of capabilities or (Cl), which are desirable features. Regarding Cj, most both. EIs evaluate the systems with an AUTOMATED scoring Overall, given these sources and level of disagreement, as instead of MANUAL or MIXED. This phenomenon can be shown in Figure 1, we considered the rubric suficiently double-edged since automated scoring is generally more validated to move from round A to round B with fewer objective and faster to calculate but also requires a proper raters, and for the analysis in the next section. definition for the scoring 10. For instance, how do we use an automated scoring to evaluate whether a robotic dancer or cook is good or bad? This may be easy for some 5. Analysis of Results human experts but quite hard to define using a metric.

Things become particularly complicated when measuring a special capability, such as common-sense reasoning.

In terms of Cv, nearly all EIs are FIXED datasets. Almost none had altered the instances by adding post-processing variations or created new to cover a range of variations intrinsically, possibly because using fixed datasets is easier than modifying instances systematically. However, this could obstruct the diversity in the evaluation methodology (e.g., sometimes it would be interesting to see how the system’s performance varies by adding noise to the data to test the model’s robustness). Surprisingly, most

EIs are UNSTRUCTURED or ABLATABLE (Ca), but almost

none are ADAPTIVE. This might be because adaptive tests are much more dificult to operate and require an understanding of what the most informative instances are.

Fairness group (Does it treat all test takers equally?): EIs that are IMPARTIAL account for 80% of the data (Fp), which seems a good indicator. However, the actual value might be even lower since it is often hard to detect impartiality. For instance, in an EI for benchmarking clinical decision support systems, the training set may only include Latin American patients but there are patients from other regions in the test set. Interestingly, virtually all the analysed EIs are classified as FULLY

INDEPENDENT (Fo), as values CUSTOMISED and LOOSE

are only 0.25 (i.e., these options were only chosen once).

The fact that current EIs have the same predetermined specification for all assessed systems is positive and a characteristic that favours fairness in evaluation. Nearly all EIs evaluate the AI systems statically rather than developmentally, possibly because for many applications we care more about the final performance rather than how the system’s performance evolves. Also, it is easier to evaluate the former than the latter. However, DEVELOPMENTAL EIs could give more insights about how the models are learning with variations of the input features and diferent curricula, detect when and why the things go wrong during the training phase, and the trade-of between number of instances, time and performance.

In summary, in the validity of the EIs, we found that most of the selected EIs that measure a capability do not necessarily measure the capability reliably. Still, these failures could serve as excellent future references for developing more robust frameworks for evaluating capabilities, and more eforts are required in the years to come. Also, we still need to improve the coverage (i.e., representativeness) in the current EIs. In addition to that, the development of more EIs with real-life settings, may encourage the development of AI systems better able to Figure 2: The distribution of the options in all facets. operate in real-life situations.

Regarding the consistency group: albeit most of the selected EIs measure efectively and verifiably, as they are the history of AI, ImageNet, is the only one where the FULLY-CONTAINED and RELIABLE, there is still an evident value PARTIAL is chosen by (at least) half of the raters, lack of diversity in the evaluation process. For instance, and also the one with all BIASED values chosen in covwe may need more EIs focusing on altering instances by erage (along with LibriSpeech). The disagreement in adding post-processing variations or creating instances partiality may suggest that some sources of partiality are to cover a range of variations intrinsically. Also, more only discovered after the repeated use of an EI and not adaptive ways to test a system should be encouraged, in identified by everyone immediately. GVGAI is peculiar order to evaluate how the system copes in circumstances as a well-thought-out EI, where video games are ablatwith diferent dificulties. Finally, in terms of fairness, able by several characteristics or dificulty of the game. the selected EIs enjoy low partiality and high objectiv- This is also going in the direction of being procedural, ity. However, more eforts are needed in spurring EIs to but still to a limited extent as per the values assigned by also focus on evaluating how a system performs during the raters for this EI. Finally, those EIs related to natuthe development process. Furthermore, the community ral language, and especially WSC, GLUE, SUPERGLUE, may need more benchmarks that focus on humans and Physical IQa, SocialQA, SQUAD2.0, WikiQA and sW/AG machines working together, since only one out of 23 EIs have high degrees of CONTAMINATED values in facet were done this way. SPECIFICITY. This might be a reflection of how dificult

When looking at the distribution of facet values per it is to isolate particular capabilities when using natural EI, we can see that those related to robotics and the phys- language, as some basic natural language competency ical world (Robocup SPL, Robocup@Home and lifelong- requires many other things. And this is reflected by the robots) have more variability in judgeability (MANUAL success of language models recently doing a variety of becomes more frequent), realism (REALISTIC and REAL- tasks [31, 32, 33, 34], since mastering natural language LIFE also become more frequent) and containedness seems to be contaminated by so many other capabilities (PARTIAL-INTERFERENCE becoming more common), as and skills. well as autonomy, with the COUPLED value being chosen in some of them. One of the most popular EIs in

6. Discussion and Conclusions

here to the order of hundreds in the future, with a more diverse and numerous pool of raters. As an immediate In section 4 we have seen disagreement between CAPA- continuation of this work ourselves, we plan to apply BILITY and PERFORMANCE (Vc), between SPECIFIC and the rubric to further EIs. We hope these facets and the CONTAMINATED (Vs), and between UNSTRUCTURED and rubric describing them can help track the evolution of AI ABLATABLE (Ca). The distributions of these facets in sec- evaluation in the years to come, and identify the facets tion 5 may illustrate a dificulty in interpreting what the where changes are happening or should happen. EI designers intended, i.e., a lack of clarity in the specification of the EI. It may also be a sign of unresolved issues Acknowledgments in AI evaluation: going from task-oriented evaluation based on performance to more general EIs leads to SPECI- We thank the anonymous reviewers for their comments. FICITY problems. For instance, adding many millions The development of this rubric was performed in the conof examples can help to coverage but comes with prob- text of the OECD AI and Future of Skills project. Several lems of specificity and more dificulty in understanding versions of the facets were discussed in a series of meetthe role each example plays in the overall score being ings within the project, and especially two meeting in measured by the EI. July 5th 2021 and October 26th, where we presented pre

Being aware of the consistency issues of the rating liminary versions of this rubric. In particular, we thank methodology, we think the set of facets and associated the OECD team (Stuart Elliott, Abel Baret, Margarita rubric, as well as the results of the study of 23 EIs reported Kalamova, Nóra Révai, Mila Staneva) and the rest of exin this paper, can be useful for three diferent kinds of perts and participants (Guillaume Avrin, Lucy Cheke, users in slightly diferent ways. First, EI creators can see Kenneth D. Forbus, Yvette Graham, Patrick Kyllonen, what design choices in their EI to modify from a first eval- Elena Messina, Britta Rüschof, Michael Schönstein, Jim uation of its facets and see how it compares to other EIs. Spohrer and Swen Ribeiro). We also thank the OECD for For AI system developers, they can choose the right EIs the funding which made this work possible as well as according to the facet values, and better understand what their encouragement. they can expect from the evaluation and what it means exactly. Finally, for policy-makers and stakeholders from academia, scientific publishing, industry, government References and other strategic organisations, an increasing number of EIs being evaluated and catalogued can serve to under- [1] A. Turing, Computing machinery and intelligence, stand the landscape of AI evaluation much better. This Mind 59 (1950) 433. can help them recognise gaps and limitations, beyond the [2] S. M. Shieber, Principles for designing an AI compeunstructured collections of benchmark results by metric tition, or why the Turing Test fails as an inducement that have become very useful for meta-analysis but still prize, AI Magazine 37 (2016) 91–96. lacking structure and insight about the EIs themselves. [3] S. M. Shieber, Lessons from a restricted Turing Test,

In fact, there have been several studies focusing on Commun. ACM 37 (1994) 70–78. numeric comparison and the evolution of performance [4] P. Hayes, K. Ford, Turing test considered harmfor a range of EIs [35, 36]. These studies see the evolution ful, in: International Joint Conference on Artificial of the progress of AI systems according to some metrics, Intelligence (IJCAI), 1995, pp. 972–977. but we need more analysis on how the evaluation in- [5] A. G. Cohn, On evaluating artificial intelligence struments (benchmarks, competitions, standards, tests, systems: Competitions and benchmarks, in: AI etc.) are also evolving, and whether they are meeting the and the Future of Skills, Volume 1 Capabilities and demands of a more comprehensive evaluation beyond Assessments, OECD, 2021, pp. 238–251. some simple metrics. This was our main motivation. [6] AERA, APA, NCME, et al., Standards for educa

We have faced some dificulties in determining the tional and psychological testing, American Educacriteria for inclusion of EIs, the isolation of some facets tional Research Association, 2014. that were dificult to understand or confused with others, [7] H. J. Levesque, The Winograd Schema Chaland finding a protocol of application that is suficiently lenge, in: Logical Formalizations of Commonrobust but at the same time requiring a limited number sense Reasoning, Papers from the 2011 AAAI of raters and other resources. We plan this setting to be a Spring Symposium, Technical Report SS-11-06, live endeavour, with some facets being added, changed or Stanford, California, USA, March 21-23, 2011, AAAI, removed in new versions of the rubric. However, some 2011. URL: http://www.aaai.org/ocs/index.php/SSS/ stability in names, facet values and facet description is SSS11/paper/view/2502. needed to be able to compile the results of diferent rating [8] M. G. Bellemare, Y. Naddaf, J. Veness, M. Bowling, studies over time, increasing from the 23 EIs evaluated The arcade learning environment: An evaluation platform for general agents, J. Artif. Intell. Res. and Computational Intelligence 3 (2019) 1–191. 47 (2013) 253–279. URL: https://doi.org/10.1613/jair. https://gaigresearch.github.io/gvgaibook/. 3912. doi:10.1613/jair.3912. [18] Y. Bisk, R. Zellers, R. LeBras, J. Gao, Y. Choi, PIQA: [9] A. Wang, A. Singh, J. Michael, F. Hill, O. Levy, reasoning about physical commonsense in natural S. R. Bowman, GLUE: A multi-task benchmark and language, in: The Thirty-Fourth AAAI Conference analysis platform for natural language understand- on Artificial Intelligence, AAAI 2020, The Thirtying, in: 7th International Conference on Learn- Second Innovative Applications of Artificial Inteling Representations, ICLR 2019, New Orleans, LA, ligence Conference, IAAI 2020, The Tenth AAAI USA, May 6-9, 2019, OpenReview.net, 2019. URL: Symposium on Educational Advances in Artificial https://openreview.net/forum?id=rJ4km2R5t7. Intelligence, EAAI 2020, New York, NY, USA, Febru[10] A. Wang, Y. Pruksachatkun, N. Nangia, A. Singh, ary 7-12, 2020, AAAI Press, 2020, pp. 7432–7439.

J. Michael, F. Hill, O. Levy, S. R. Bowman, Super- URL: https://ojs.aaai.org/index.php/AAAI/article/ glue: A stickier benchmark for general-purpose view/6239. language understanding systems, in: H. M. Wal- [19] N. Froleyks, M. Heule, M. Iser, M. Järvisalo, M. Suda, lach, H. Larochelle, A. Beygelzimer, F. d’Alché-Buc, SAT competition 2020, Artif. Intell. 301 (2021) E. B. Fox, R. Garnett (Eds.), Advances in Neural 103572. URL: https://doi.org/10.1016/j.artint.2021. Information Processing Systems 32: Annual 103572. doi:10.1016/j.artint.2021.103572. Conference on Neural Information Processing [20] R. Zellers, Y. Bisk, A. Farhadi, Y. Choi, From Systems 2019, NeurIPS 2019, December 8-14, 2019, recognition to cognition: Visual commonsense Vancouver, BC, Canada, 2019, pp. 3261–3275. URL: reasoning, in: IEEE Conference on Computer https://proceedings.neurips.cc/paper/2019/hash/ Vision and Pattern Recognition, CVPR 2019, Long 4496bf24afe7fab6f046bf4923da8de6-Abstract.html. Beach, CA, USA, June 16-20, 2019, Computer Vision [11] J. Deng, W. Dong, R. Socher, L. Li, K. Li, L. Fei- Foundation / IEEE, 2019, pp. 6720–6731. URL: Fei, Imagenet: A large-scale hierarchical image http://openaccess.thecvf.com/content_CVPR_2019/ database, in: 2009 IEEE Computer Society Con- html/Zellers_From_Recognition_to_Cognition_ ference on Computer Vision and Pattern Recogni- Visual_Commonsense_Reasoning_CVPR_2019_ tion (CVPR 2009), 20-25 June 2009, Miami, Florida, paper.html. doi:10.1109/CVPR.2019.00688. USA, IEEE Computer Society, 2009, pp. 248–255. [21] Assembly performance metrics and URL: https://doi.org/10.1109/CVPR.2009.5206848. test methods, https://www.nist.gov/ doi:10.1109/CVPR.2009.5206848. el/intelligent-systems-division-73500/ [12] J. Renz, X. Ge, M. Stephenson, P. Zhang, AI meets robotic-grasping-and-manipulation-assembly/ angry birds, Nat. Mach. Intell. 1 (2019) 328. URL: assembly, 2018. https://doi.org/10.1038/s42256-019-0072-x. doi:10. [22] A. L. Maas, R. E. Daly, P. T. Pham, D. Huang, A. Y. 1038/s42256-019-0072-x. Ng, C. Potts, Learning word vectors for sentiment [13] S. A. Gaggl, T. Linsbichler, M. Maratea, S. Woltran, analysis, in: D. Lin, Y. Matsumoto, R. Mihalcea Design and results of the second international com- (Eds.), The 49th Annual Meeting of the Association petition on computational models of argumenta- for Computational Linguistics: Human Language tion, Artif. Intell. 279 (2020). URL: https://doi.org/10. Technologies, Proceedings of the Conference, 19-24 1016/j.artint.2019.103193. doi:10.1016/j.artint. June, 2011, Portland, Oregon, USA, The Association 2019.103193. for Computer Linguistics, 2011, pp. 142–150. URL: [14] The robocup standard platform league, https://spl. https://aclanthology.org/P11-1015/.

robocup.org/, 1998. [23] M. Sap, H. Rashkin, D. Chen, R. LeBras, Y. Choi, [15] The robocup@home league, https://athome. Socialiqa: Commonsense reasoning about social inrobocup.org/, 2006. teractions, CoRR abs/1904.09728 (2019). URL: http: [16] V. Panayotov, G. Chen, D. Povey, S. Khudanpur, Lib- //arxiv.org/abs/1904.09728. arXiv:1904.09728. rispeech: An ASR corpus based on public domain [24] M. R. Genesereth, N. Love, B. Pell, General audio books, in: 2015 IEEE International Confer- game playing: Overview of the AAAI competition, ence on Acoustics, Speech and Signal Processing, AI Mag. 26 (2005) 62–72. URL: https://doi.org/10. ICASSP 2015, South Brisbane, Queensland, Aus- 1609/aimag.v26i2.1813. doi:10.1609/aimag.v26i2. tralia, April 19-24, 2015, IEEE, 2015, pp. 5206–5210. 1813.

URL: https://doi.org/10.1109/ICASSP.2015.7178964. [25] P. Rajpurkar, J. Zhang, K. Lopyrev, P. Liang, Squad: doi:10.1109/ICASSP.2015.7178964. 100, 000+ questions for machine comprehension [17] D. Perez-Liebana, S. M. Lucas, R. D. Gaina, J. To- of text, in: J. Su, X. Carreras, K. Duh (Eds.), gelius, A. Khalifa, J. Liu, General video game arti- Proceedings of the 2016 Conference on Empirical ifcial intelligence, Synthesis Lectures on Games Methods in Natural Language Processing, EMNLP 2016, Austin, Texas, USA, November 1-4, 2016, The arXiv:2108.07258, 2021.

Association for Computational Linguistics, 2016, [35] F. Martinez-Plumed, P. Barredo, S. O. Heigeartaigh, pp. 2383–2392. URL: https://doi.org/10.18653/v1/ J. Hernandez-Orallo, Research community dynamd16-1264. doi:10.18653/v1/d16-1264. ics behind popular AI benchmarks, Nature Machine [26] Y. Yang, W. Yih, C. Meek, WikiQA: A challenge Intelligence 3 (2021) 581–589. dataset for open-domain question answering, in: [36] A. Barbosa-Silva, S. Ott, K. Blagec, J. Brauner, L. Màrquez, C. Callison-Burch, J. Su, D. Pighin, M. Samwald, Mapping global dynamics of benchY. Marton (Eds.), Proceedings of the 2015 Confer- mark creation and saturation in artificial intellience on Empirical Methods in Natural Language gence, arXiv preprint arXiv:2203.04592 (2022). Processing, EMNLP 2015, Lisbon, Portugal, September 17-21, 2015, The Association for Computational Linguistics, 2015, pp. 2013–2018. URL: https: //doi.org/10.18653/v1/d15-1237. doi:10.18653/v1/ d15-1237. [27] R. Zellers, Y. Bisk, R. Schwartz, Y. Choi, SWAG: A large-scale adversarial dataset for grounded commonsense inference, in: E. Rilof, D. Chiang, J. Hockenmaier, J. Tsujii (Eds.), Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing, Brussels, Belgium, October 31 - November 4, 2018, Association for Computational Linguistics, 2018, pp. 93–104. URL: https: //doi.org/10.18653/v1/d18-1009. doi:10.18653/v1/ d18-1009. [28] A. Marot, B. Donnot, G. Dulac-Arnold, A. Kelly,

A. O’Sullivan, J. Viebahn, M. Awad, I. Guyon, P. Panciatici, C. Romero, Learning to run a power network challenge: a retrospective analysis, in: H. J. Escalante, K. Hofmann (Eds.), NeurIPS 2020 Competition and Demonstration Track, 6-12 December 2020, Virtual Event / Vancouver, BC, Canada, volume 133 of Proceedings of Machine Learning Research, PMLR, 2020, pp. 112–132. URL: http://proceedings.

mlr.press/v133/marot21a.html. [29] L. Yang, Sdkd: Saliency detection with knowledge distillation, https://lifelong-robotic-vision.github.

io/competition/papers/PekingU_linyang.pdf, 2019. [30] C.-C. Hsu, B. A. Sandford, The Delphi technique: making sense of consensus, Practical assessment, research, and evaluation 12 (2007) 10. [31] J. Devlin, M.-W. Chang, K. Lee, K. Toutanova,

Bert: Pre-training of deep bidirectional transformers for language understanding, arXiv preprint arXiv:1810.04805 (2018). [32] T. B. Brown, B. Mann, N. Ryder, M. Subbiah, J. Kaplan, P. Dhariwal, A. Neelakantan, P. Shyam, G. Sastry, A. Askell, et al., Language models are few-shot learners, in: Advances in Neural Information Processing Systems, volume 33, 2020, pp. 1877–1901. [33] D. Hendrycks, S. Basart, S. Kadavath, M. Mazeika,

A. Arora, E. Guo, C. Burns, S. Puranik, H. He, D. Song, J. Steinhardt, Measuring coding challenge competence with APPS, 2021. arXiv:2105.09938. [34] R. Bommasani, et al., On the opportunities and risks of foundation models, arXiv preprint