A Framework for Categorising AI Evaluation Instruments
Anthony G Cohn1 , José Hernández-Orallo2 , Julius Sechang Mboli3 , Yael Moros-Daval2 ,
Zhiliang Xiang4 and Lexin Zhou2
1
  School of Computing, University of Leeds, UK; and the Turing Institute, UK
2
  VRAIN, Universitat Politècnica de València, Spain
3
  Faculty of Engineering and Informatics, University of Bradford, UK
4
  IROHMS, School of Computer Science and Informatics, Cardiff University, UK


                                             Abstract
                                             The current and future capabilities of Artificial Intelligence (AI) are typically assessed with an ever increasing number of
                                             benchmarks, competitions, tests and evaluation standards, which are meant to work as AI evaluation instruments (EI). These
                                             EIs are not only increasing in number, but also in complexity and diversity, making it hard to understand this evaluation
                                             landscape in a meaningful way. In this paper we present an approach for categorising EIs using a set of 18 facets, accompanied
                                             by a rubric to allow anyone to apply the framework to any existing or new EI. We apply the rubric to 23 EIs in different
                                             domains through a team of raters, and analyse how consistent the rubric is and how well it works to distinguish between EIs
                                             and map the evaluation landscape in AI.

                                             Keywords
                                             Evaluation Instruments, Comparison of Evaluation Instruments, Categorisation of Evaluation Instruments, Artificial Intelli-
                                             gence Evaluation, Future of Skills


1. Introduction                                                                                                       any real important progress in AI has been demonstrated
                                                                                                                      by the entrants. In fact, Turing himself never proposed
Ever since researchers started building AI systems, they                                                              the test as a serious way of measuring AI systems or of
have wanted to evaluate them, either against human                                                                    measuring progress, as Schieber [2] observes, adding, it
benchmarks (such as playing humans experts at Chess or                                                                is “misguided and inappropriate” ([3, 4]). Instead he ar-
other games) and/or against other AI systems. Finding                                                                 gues for new “inducement prize” contests. According to
good benchmarks for evaluating systems, and conducting                                                                Schieber, these are “award programs established to induce
tests is harder than it might seem, particularly since we                                                             people to solve a problem of importance by directly re-
believe we have good methods for evaluating human                                                                     warding the solver”. Perhaps the most famous historical
intelligence, via standard tests and examinations.                                                                    examples are the Longitude Rewards offered by the UK
   There have been many tests proposed for evaluating                                                                 government in 1714. A current example is the $5M IBM
AI systems. Probably the most famous of these of course                                                               Watson AI XPRIZE which “challenges teams to demon-
is known as the Turing Test[1]. There have been various                                                               strate how humans can work with AI to tackle global
Turing Test competitions, of which the best known is the                                                              challenges”. Further discussion on the use of competi-
annual Loebner Prize competition; the results have been                                                               tions, benchmarks and datasets in evaluating AI systems
sometimes entertaining, and a way of promulgating ideas                                                               can be found in [5].
about AI to the general public, but it is hard to argue that                                                             The situation today is that there are thousands of chal-
IJCAI2022 Workshop on AI Evaluation Beyond Metrics (EBeM’22),
                                                                                                                      lenges in almost all areas of AI. They are increasing in
July 24, 2022, Vienna, Austria                                                                                        complexity and diversity, as AI techniques evolve like-
Envelope-Open a.g.cohn@leeds.ac.uk (A. G. Cohn); jorallo@upv.es                                                       wise. Because of this, it is hard to analyse this eval-
(J. Hernández-Orallo); mboli4god@gmail.com (J. S. Mboli);                                                             uation landscape in a meaningful way. Motivated by
ymordav@inf.upv.es (Y. Moros-Daval); xiangz6@cardiff.ac.uk                                                            this need, we present and discuss an approach to cate-
(Z. Xiang); lzhou@inf.upv.es (L. Zhou)
GLOBE https://eps.leeds.ac.uk/computing/staff/76/
                                                                                                                      gorising benchmarks, competitions, tests and evaluation
professor-anthony-tony-g-cohn-freng-ceng-citp/ (A. G. Cohn);                                                          standards, jointly referred to as AI evaluation instruments
http://josephorallo.webs.upv.es/ (J. Hernández-Orallo);                                                               (EI). We do this categorisation via a set of 18 facets, which
https://jsmboli.github.io/jsmboli/ (J. S. Mboli);                                                                     we believe will be valuable in distinguishing and evaluat-
https://zl-xiang.github.io/ (Z. Xiang); https://lexzhou.github.io/                                                    ing different proposals for evaluating AI systems. These
(L. Zhou)
                                                                                                                      facets, and an accompanying rubric to facilitate choosing
Orcid 0000-0002-7652-8907 (A. G. Cohn); 0000-0001-9746-7632
(J. Hernández-Orallo); 0000-0003-1708-3052 (J. S. Mboli);                                                             appropriate values, are described in section 2.
0000-0001-5442-2055 (Y. Moros-Daval); 0000-0002-0263-7289                                                                We will classify EIs using the facets in order to (a)
(Z. Xiang); 0000-0003-1161-4270 (L. Zhou)                                                                             evaluate how well the facets work in general and (b) to
                                       © 2022 Copyright for this paper by its authors. Use permitted under Creative
                                       Commons License Attribution 4.0 International (CC BY 4.0).                     what extent they help mapping the landscape of EIs and
    CEUR
    Workshop
    Proceedings
                  http://ceur-ws.org
                  ISSN 1613-0073
                                       CEUR Workshop Proceedings (CEUR-WS.org)
distinguish their differences. This may help inform how                      the options in brackets. Some options indicate ‘(specify)’,
much we can translate from the facet values to guide the                     which means that the rater must indicate a (freetext)
design of future EIs. We do not imagine there can be a                       value for that option. The full description of the facets
single universal evaluation instrument, or even a battery                    usually include some examples and further clarifications2 .
for each domain (vision, reasoning, etc.); certainly that                    Here we only include the basic definition of each of them.
ideal has eluded the community so far. We do not even                        We use colours (blue and black) that are indicative, with
aspire to find facet values that are valid for all EIs but                   blue referring to the preferred or most challenging case,
our proposed work may help in directing future efforts                       in general. However, for some facets a blue value may
in the evaluation of AI systems.                                             make no sense, or we do not believe that one value is
   Since it is infeasible in a reasonable amount of time to                  ‘better’ than any other, so these facets have no coloured
apply this categorisation to the thousands of EIs in the lit-                facet value(s).
erature, here we cover 23 EIs (see Table 2). By evaluating                   • Vp - Purpose [RESEARCH, CONFORMITY, OTHER (spec-
a reasonable number of carefully chosen examples, we                            ify)]: Is the benchmark meant to foster research or
hope to give a fair picture of the extent to which the as-                      development, or to certify whether an AI system con-
pects of AI appraised by the facets are being tested in the                     forms with some level or standard?
selected examples. Beyond the insights that we extract                       • Vc - Capability [TASK-PERFORMANCE (specify), CAPA-
from this selected set of EIs, this paper and the rubric we                     BILITY (specify)]: Does the EI just measure observed
have developed for the different facets should serve as                         (aggregated) performance on a TASK (e.g., protein fold-
a reference for third parties (e.g., other researchers) to                      ing, credit scoring) or is the EI designed to also measure
analyse other EIs.                                                              a CAPABILITY (e.g., object permanence, dealing with
   The rest of the paper is organised as follows. Sec-                          negation)?
tion 2 presents the 18 facets and a rubric which explains                    • Vf - Reference [ABSOLUTE, RELATIVE (specify)]: Are
how facet values should be chosen. Next, in section 3,                          results reported as an absolute metric (criterion-
we discuss the criteria for selecting the 23 EIs and the                        referenced) or are they reported as a relative (percent-
methodology the raters used to apply the rubric. Section                        age) metric to a reference (norm referenced), e.g., hu-
4 discusses the level of disagreement between raters for                        man performance?
each facet and EI, and how the methodology and the                           • Vo - Coverage [BIASED (specify), REPRESENTA-
number of raters was adapted based on these observa-                            TIVE]: Does the EI cover a BIASED or unbiased
tions. Section 5 analyses the ratings of the 23 EIs, and                        (REPRESENTATIVE) distribution of what is meant to
what they reveal about this group of EIs. Finally, section                      be measured?
6 closes with some general discussion and possible future                    • Vs - Specificity [SPECIFIC, CONTAMINATED]: Are the
work.                                                                           results precisely aligned with what is meant to be mea-
                                                                                sured or contaminated by other skills or tasks?
                                                                             • Vl - Realism [TOY, GAMIFIED, REALISTIC, REAL-LIFE]:
2. Characterising AI Evaluation                                                 To what extent is the EI a toy problem, a complex
   Instruments                                                                  gamified problem, is it a realistic setting (e.g., but still
                                                                                in a simulated scenario, a lab or testing facility) or is
We looked for existing features or dimensions to char-                          the evaluation itself happening in real life3 ?
acterise EIs, but unfortunately we did not find any sys-                     • Cj - Judgeability [MANUAL, AUTOMATED, MIXED]: Is
tematic account in AI, other than concepts such as repro-                       scoring manual (e.g., through human questionnaires
ducibility, realism, coverage and specificity, usually re-                      or judges) or automated (e.g., correct answers or opti-
ferred to with other names and applied to a single EI. We                       mality function) or a mixture?
found more dimensions and a more systematic coverage                         • Cc - Containedness [FULLY-CONTAINED, PARTIAL-
of evaluation instruments in the area of psychological                          INTERFERENCE (specify), NOT-CONTAINED (specify)]:
testing. As a result, we have introduced a new set of                           Once started, is the testing isolated from external fac-
facets, but when possible, the terminology is based on                          tors or interference possibly having an effect on the
the common use in AI, but also incorporating terms and                          results (human participants, online data, weather, etc.),
concepts from the Standards for Educational and Psycho-                         or is there some partial interference not affecting the
logical Testing by the American Educational Research                            results significantly or is it dependent of external re-
Association [6].                                                                sources and conditions?
   The following list1 proposes 18 facets to characterise
                                                                                 2
existing and future EIs for AI. Each facet is followed by                          The latest version of the rubric can be found in https://tinyurl.
                                                                             com/mr2bv5hb
                                                                                 3
                                                                                   REAL-LIFE does not mean a final or specific product in oper-
     1
       Each facet has both a name and a two letter acronym, whose            ation. It can also happen in very early stages of research, such as
initial letter is V, C or F, the reason for which will become clear later.   evaluating prototype chatbots in a real social network.
• Cp - Reproducibility [NON-REPRODUCIBLE, STOCHAS-                   the system within the test?
  TIC, EXACT]: Is the evaluation non-reproducible, with            • Fu - Autonomy [AUTONOMOUS, COUPLED (specify),
  results biased or spoiled if repeated; does the EI have            COMPONENT]: Is it measuring an autonomous sys-
  stochastic components leading to different interac-                tem, coupled with other systems (e.g., humans) or as a
  tions; or are the results completely reproducible, i.e.            component?
  can exactly the same test (inputs, interaction, etc.) be         The facets above can be grouped into three main cat-
  generated again for another (or the same) competitor?            egories following the three main groups given by the
• Cl - Reliability [RELIABLE, NON-RELIABLE, N/A]: Does             Standards for Educational and Psychological Testing [6]:
  the evaluation present sufficient repetitions, episode           validity, reliability/precision and fairness. We use these
  length or number of instances to give low variance               three major groups to give some structure to the facets
  for the same subject when applied again (test-retest             above. Roughly, these groups deal with what is measured,
  reliability)? If the testing methodology or the common           how it is measured and who is measured, respectively.
  use of the EI is not clear then N/A may be the most              • Validity group (Does it measure what we want to mea-
  appropriate facet value.                                           sure?): Vp, Vc, Vf, Vo, Vs, Vl
• Cv - Variation [FIXED, ALTERED, PROCEDURAL]: Is                  • Consistency (Reliability/Precision) group (Does it mea-
  the evaluation based on fixed datasets; have the in-               sure it effectively and verifiably?: Cj, Cc, Cp, Cl, Cv,
  stances been altered by adding post-processing varia-              Ca
  tions (noise, rotations, etc.); or have they been created        • Fairness group (Does it treat all test takers equally?):
  (e.g., using procedural generation4 )?                              Fn, Fm, Fp, Fo, Fr, Fu
• Ca - Adjustability [UNSTRUCTURED, ABLATABLE,                     Some of these are closely related, such as {Cv,Ca,Vo} or
  ADAPTIVE]: Is the analysis of results on the set of in-          {Fo,Cp}. The term accommodation in [6] is “used to de-
  stances unstructured; or has the EI identified a set             note changes with which the comparability of scores is
  of meta-features such as difficulty or dimension that            retained, and the term modification is used to denote
  could be used to analyse the results by these dimen-             changes that affect the construct measured by the test”.
  sions (ablatable); or are these meta-features used to            This is related to Vs, Cv, Fo and Cc, and also to the term
  adaptively or adversarially choose the instances to test         “measurement invariance”, which is very important here
  more informatively (adaptive)?                                   to see if accommodations of the same test could evalu-
• Fn - Antecedents [CREATED, RETROFITTED (specify)]:               ate the same construct for different AI systems and even
  Is it devised on purpose for AI or adapted from tests            humans.
  designed to test humans.
• Fm - Ambition [SHORT, LONG]: When the EI was cre-
  ated, was it aiming at the short term (improving on              3. EI Selection and Rating
  the SOTA) or long term (more ambitious goals)?                      Methodology
• Fp - Partiality [PARTIAL (specify), IMPARTIAL]: Does
  the EI favour particular technologies, conditions or             Now that the facets and the rubric have been explained,
  cultures that should not have an influence on the result         we proceed to discuss how the EIs were selected, what
  of the evaluation5 ?                                             the final selection was, and what protocol we followed
• Fo - Objectivity [LOOSE, CUSTOMISED, FULLY-                      in assigning EIs to the raters.
  INDEPENDENT]: Is it loosely defined, customised to
  each participant or does the EI have a predetermined             3.1. EI Selection
  independent specification6 .
• Fr - Progression [STATIC, DEVELOPMENTAL]: Is the                   We considered evaluation instruments with the following
  score measuring a capability at one particular moment              criteria for inclusion:
  or is it evaluating the development of the capability of           • Potential interest to understand the future of AI skills:
                                                                       An EI might be regarded as being of interest if systems
     4
       Although we have coloured PROCEDURAL, we recognise              which perform well on it can be regarded as indicat-
that procedural may not always be better and can lead to problems       ing a noteworthy change in the capabilities of AI in
if variations are not in an appropriate proportion. Also, generated     general. In other words, progress in this EI requires
data may just lead to a learning algorithm reverse-engineering the      significant enhancement of AI techniques beyond the
generator.
     5
       Vo-Coverage is about the domain, whilst Fp-Partiality is         specific requirements of the EI.
about how the EI may favour some test-takers over others.            • Diversity in the kind of task: We tried to cover a variety
     6
       LOOSE refers to cases when evaluation is very open, e.g., a      of domains, formats and types of problems (vision,
robotic-domain EI where we evaluate on a satisfactory interaction       natural language, competitions, datasets, supervised,
with the user, but not even a clear questionnaire is defined. FULLY-
                                                                        etc).
INDEPENDENT could treat different groups differently if there is
a reason for equality of treatment.
    Level                             2 options                  3 options                  4 options             Total
    Consistently Agreed               Fr, Fn                     Vp , Cj, Cc, Fo, Fu        -                     7
    Moderately Agreed                 Vf, Fp                     Cp, Cv                     Vl                    5
    Often Diverged                    Vc, Vo, Vs, Fm             Cl, Ca                     -                     6

Table 1
Level of agreement for the 18 facets, according to the number of options for each facets.


• Popularity: How many teams have already used this               3.2. Rating Methodology
   EI? How many published papers refer to it? We can
                                                                   We devised a protocol to refine and validate the rubric,
   use proxies for this, such as citations to the original
                                                                   but also to cover as many EIs as possible, according to
   papers introducing the EI, the number of results on
                                                                   the number of raters we had available. We explain the
   websites such as paperswithcode.com. We also have to
                                                                   protocol below, but we note that this protocol can be
   consider that industry-related EI may be less popular
                                                                   adapted to other situations or can incorporate ideas from
   than research-oriented EIs. However, given the num-
                                                                   consensus-based ratings or the Delphi method [30]. First,
   ber of EIs selected, we repeat domains and cover just
                                                                   two of the authors of this paper (A.C. and J.H-O.) acted
   a few areas (e.g., NLP, vision, robotics) without being
                                                                   as coordinators for the rating process. A total of four
   comprehensive for all possible domains.
                                                                   raters were chosen. Raters were AI-related undergradu-
• Currency: we prefer EIs still in active use or recently
                                                                   ate and graduate students, and were recruited through a
   introduced, rather than those which have fallen out of
                                                                   selection process and interviews. They are the other four
   use.
                                                                   authors of this paper (J-S.M., Y. M-D, Z.X. and L.Z.). Once
The source of the EIs was mostly repositories7 and sur-
                                                                   the raters were appointed, each rater was given some
veys, institutions such as NIST8 and LNE9 , and compe-
                                                                   meta-information about each EI (acronym, name, ma-
titions at AI conferences. Then, we identified possible
                                                                   jor sources, what it measures, etc.) and had to complete
gaps in terms of domains or whether we expect that the
                                                                   some other general information about each EI. They were
answers for some facets are going to be too similar. We
                                                                   also asked some information about their own completion,
also considered whether we would expect to get diversity
                                                                   such as time taken (in hours).
in the values in blue for the facets, so that we get different
                                                                      We established three batches, covering 2, 11 and 10
levels of quality according to this colour code. Note that
                                                                   EIs respectively, in the order they are presented in Table
at the time of selection we could of course only roughly
                                                                   2. The first two EIs had already been used by the coor-
estimate how many blue categories we might get for each
                                                                   dinators in developing the list of facets and their values.
EI. Since we expected to learn more about the categoris-
                                                                   All the subsequent raters started off on these two EIs too
ing of EIs as categorisation proceeded, we did not choose
                                                                   and were given feedback on their chosen values before
all EIs in advance but selected them incrementally. The
                                                                   proceeding to any further EIs. We refer to these two EIs
23 selected EIs are shown in Table 2.
                                                                   as “Batch 1”. The next 11 EIs are referred to as “Batch
   These EIs cover a good distribution of benchmarks,
                                                                   2”. These two batches were done by all four raters, inde-
competitions and datasets, although some of them can
                                                                   pendently. After the analysis of consistency we deemed
be considered to be in two of these categories. The term
                                                                   sufficient to only have two raters per EI. Then, a final
‘test’ to refer to an EI is less usual. About half of the
                                                                   set of 10 EIs, referred to as “Batch 3”, were each rated by
23 EIs require the use of language in the inputs and/or
                                                                   just two raters, for reasons of economy, since we already
outputs, and about one half of them require some kind of
                                                                   had reasonable inter-rater consistency after the end of
perception (mostly computer vision), with some overlap
                                                                   batches 1 and 2. The two raters for each EI were assigned
in these two groups. Only a few of the EIs are related to
                                                                   so that all raters would have five EIs, and across their five
navigation and robotics, in virtual (e.g., video games) or
                                                                   EIs, they co-rated with all the other three raters (i.e., one
physical environments, and a small number are related to
                                                                   EI with one other rater and two EIs with each of the other
more abstract capabilities or problems related to planning
                                                                   raters). In this first stage, they worked independently, not
or optimisation.
                                                                   sharing values for any of the facets, and only reporting
    7
      http://paperswithcode.com,               http://kaggle.com, questions and partial results to the coordinators.
https://zenodo.org/record/4647824#.YV7CPdrMKUk,             https:    There were some changes of the rubric between
//www.eff.org/ai/metrics, https://en.wikipedia.org/wiki/List_of_ batches, especially clarifying the description of some
datasets_for_machine-learning_research, http://www.chalearn.org. of the facets, and in a few cases, changes in the number
    8
      https://www.nist.gov/programs-projects/
ai-measurement-and-evaluation
    9
      https://www.lne.fr/en/testing/                              evaluation-artificial-intelligence-systems
               Table 2
               EIs given to raters and included in our analysis.
   Acronym                                Type                           Domain                               Aim                                                                                                                                  Year
   WSC [7]                                test,  bench-                  LU, CS, reasoning                    It was specifically targeted to evaluate common sense reasoning, as an alter-                                                        2016
                                          mark        &                                                       native to the Turing test, arguing conceptual and practical advantages
                                          competition
   ALE [8]                                benchmark                      VG; navigation;                      The original goal was to evaluate “general, domain-independent AI technol-                                                           2013
                                                                         perception                           ogy”, by using a diversity of video games, although what it measures more
                                                                                                              specifically is unclear.
   GLUE [9]                               benchmark                      LU; text retrieval;                  The goal of GLUE and superGLUE (an improvement/modified version of                                                                   2018
                                                                         world knowledge                      GLUE) is to measure the performance (e.g. accuracy, F1-score) of an AI system
                                                                                                              in natural language understanding tasks (Single-Sentence Tasks, Similarity
                                                                                                              and Paraphrase Tasks, and Inference Tasks) in English.
   SUPERGLUE [10]                         benchmark                      video games; navi-                   The goal of GLUE and superGLUE (an improvement/modified version of                                                                   2019
                                                                         gation; perception                   GLUE) is to measure the performance (e.g. accuracy, F1-score) of an AI system
                                                                                                              in natural language understanding tasks (Single-Sentence Tasks, Similarity
                                                                                                              and Paraphrase Tasks, and Inference Tasks) in English.
   IMAGENET [11]                          competition                    image       classifi-                Aims to measure the visual recognition capability for object recognition,                                                            2010
                                                                         cation;      object                  image classification, and object localisation. The images can contain different
                                                                         recognition; object                  numbers of objects (e.g. mammal, bird, fish, vehicle, furniture, tool, flower,
                                                                         localisation                         fruit, etc.), occlusions, and clutters (i.e. diversity and noise).
   AIBIRDS [12]                           competition                    CV, VG, KRRP                         Measures the planning capability of an agent in a large action space, without                                                        2010
                                                                                                              knowing of the physical parameters of objects, situation given by Angry Birds.
   ICCMA [13]                             competition                    reasoning; AA, CL                    Aims to measure/compare the performance of different solvers regarding                                                               2015
                                                                                                              argumentation (particularly, reasoning problem that requires logic).
   Robocup SPL[14]                        competition                    RCRPVMASS                            The aim is to measure & promote improvements in multi-robot (humanoid)                                                               1998
                                                                                                              systems by playing soccer matches with robots
   Robocup@home                           competition                    HRIC, NMDE, CV,                      aims to measure the performance of the developed AI robots in providing                                                              2006
   [15]                                                                  ABP.                                 service with assistive robot technology with high relevance for future personal
                                                                                                              domestic applications.
   Librispeech-                           dataset                        speech recognition                   Aims to provide freely available read speech corpus in English that is suitable                                                      2015
   SL12 [16]                                                                                                  for training and testing speech recognition systems.
   GVGAI [17]                             competition                    VG;general AI; PN                    Aimed to systems that can perform well in multiple video games, possibly                                                             2014
                                                                                                              without knowing the game in advance and with little to no specific domain
                                                                                                              knowledge, as an approximation to artificial general intelligence
   PIQA [18]                              benchmark                      PCU, NLP, reason-                    Aims to measure physical interaction reasoning about both the prototypical                                                           2019
                                          dataset                        ing                                  use of objects (e.g., shoes are used for walking) and non-prototypical but
                                                                                                              practically plausible use of objects (e.g., shoes can be used as a doorstop).
                                                                                                              It targets language representations of knowledge traditionally only seen or
                                                                                                              experienced.
   SAT [19]                               competition                    boolean satisfiabil-                 Aims to keep progress & further improve the performance & robustness of SAT                                                          2002
                                                                         ity                                  solvers, with a history dating back to the early 90s, thanks to the persistent
                                                                                                              efforts of the SAT community.
   VCR [20]                               dataset                        CR; cognition; VR                    It aims to measure the ability to infer what is happening in a picture (people’s                                                     2019
                                                                                                              actions, goals, etc.) from visual signs which are obvious for humans.
   Assembly [21]                          competition                    RM, ARH, MPLT,                       Identifying key competencies and characteristics of robotic systems using a                                                          2017
                                                                         DiHM, RGVELO,                        robust set of formalized evaluations and benchmarks. To help to match robotic
                                                                         Anthropomorphic                      hand capabilities to end-user needs as well as to help provide developers and
                                                                                                              researchers insight for improving their hardware and software designs
   IMDb [22]                              dataset                        NLP                                  Detecting the sentiment of a piece of text                                                                                           2011
   SocialIQA [23]                         benchmark                      SI, SIn, EI, IR                      Aimed to measure the social and emotional intelligence of computational                                                              2019
                                                                                                              models through multiple choice question answering
   GGP [24]                               competition                    game playing                         General game playing (GGP) is the design of artificial intelligence programs                                                         2005
                                                                                                              to be able to play more than one game successfully.
   SQUAD2.0 [25]                          dataset                        reading compre-                      It aims to measure reading comprehension abilities that allows a system to                                                           2018
                                                                         hension; NLP                         get a correct answer to a given question when the solution can be extracted
                                                                                                              from the text or abstain from answering otherwise
   WikiQA [26]                            benchmark                      NLP                                  WIKIQA is a dataset for opendomain question answering                                                                                2014
                                          dataset
   sW/AG [27]                             dataset, bench-                NLI, CR                              Aims to evaluate the performance of a system in grounded commonsense                                                                 2018
                                          mark                                                                inference (reasoning about a situation and anticipate what might come next)
                                                                                                              by answering multiple choice questions
   L2RPN [28]                             competition                    SG, AI, PG, PN                       This challenge aims at testing the potential of AI to address this important                                                         2012
                                                                                                              real-world problem for our future.
   Lifelong-                              competition                    robotics, CV, RV                     Provides a robotic vision dataset collected from real time environments to                                                           2019
   Robots [29]                                                                                                accelerate both research and applications of visual models for robotics.
 Abbreviations: HRIC = Human-Robot-Interaction and Cooperation; NMDE = Navigation and Mapping in dynamic environments; CV = Computer Vision, ABP = Adaptive Behaviors, planning; AA = abstract argumentation; CL = computational logic; VG = video
games; KRRP = knowledge representation; reasoning; planning; RCRPVMASS robotics; cooperation; real-time planning; vision; multiagent systems; strategy; LU = Language understanding; CS = common sense; RM = Robotics in Manufacturing; ARH = Adaptive
Robot hands, MPLT = Manipulation planning based on learning techniques;DiHM = Dexterous in-hand manipulation; RGVELO = Robust grasping with various everyday life objects; SI = social interaction, SIn = social intelligence, EI = emotional intelligence, IR =
inferential reasoning; CR = commonsense reasoning;VR = visual recognition; PN = planning and navigation, SG = Smart Grids, PG = Power Grids, PN = Power networks, PCU = physical commonsense understanding, NLI = natural language inference, RV = Robotic
vision
and/or name of the options. Whenever a change was                                                                  The pattern of agreement or disagreement amongst
introduced, the raters were informed and had to revisit                                                         the raters tend to vary depending on several factors such
their ratings for previous batches.                                                                             as facet complexity, available information on the EI, and
   In a second and final stage of the process, the coor-                                                        so on. In particular, we observe the following:
dinators allowed the raters to exchange opinions, but                                                           • Fr, Fn, Vp, Cj, Cc, Fo, Fu are consistently agreed across
they were not asked to reach a consensus, just to identify                                                         all batches, with very few disagreements.
possible misunderstandings. From this discussion, a few                                                         • Vf, Vl, Cp, Cv, Fp appear to be moderately agreed and
ratings were modified. Unless explicitly stated, we refer                                                          supported by a majority (≥ 75%). Notably, Vl has the
to these final ratings in the rest of the paper.                                                                   largest number of value options, but still agreed well
                                                                                                                   by a majority.
                                                                                                                • While selections on Vo, Vs and Cl with binary options,
4. Analysis of Rater Consistency                                                                                   are two of the least agreed ones.
                                                                                                                It is not surprising that some of the facets consistently
As noted above, the 1st and 2nd batches differ from batch
                                                                                                                reached consensus considering the facet values tend to
3 because the former had four raters whilst the latter only
                                                                                                                distribute towards one single selection (detailed in Sec-
two. Thus, in the former case, a majority agreement can
                                                                                                                tion 5). For instance, as we will see in the following
be formed with three or four raters agreeing, whilst in
                                                                                                                section, RESEARCH is picked for the Vp facet with only
batch 3 only when both raters agree; hence ‘majority’ is
                                                                                                                one disagreement for all rounds. This might reflect the
less statistically significant for the 3rd batch. For simplic-
                                                                                                                fact that some EIs do not have much variability in their
ity, we will use round A and round B respectively when
                                                                                                                options. For example, most EIs are indeed proposed for
referring to the first two batches and the 3rd batch. As
                                                                                                                the purpose of research (Vp), and given the low vari-
shown in Figure 1, the level of agreement coincides to a
                                                                                                                ability in the values there cannot be much disagreement
great extent when comparing the results from all batches
                                                                                                                (the variance of a Bernoulli distribution). As the variabil-
(Figure 1, top) with the individual ones from round A
                                                                                                                ity of facets increases, choosing answers for the facets
(Figure 1, middle) and round B (Figure 1, bottom). It can
                                                                                                                might require more EI-specific domain knowledge from
be expected that those facets with more possible values
                                                                                                                the raters. For instance, to make justifiable decisions for
(4) might have more disagreements than those with only
                                                                                                                facets like Vo and Vs, raters often need to seek related
two possible values, simply for statistical reasons. We can
                                                                                                                literature for support when the answers were not clear
see that in fact this is not having a big effect, as shown
                                                                                                                from the specifications of EIs. Whether an EI is specific
in Table 1.
                                                                                                                (Vs) and general (Vo) enough for the measuring of certain
                                   Facets Agreements for Both Rounds                                            capabilities is indeed hard to judge depending solely on
  30                                                                                               disagreement
                                                                                                   agreement
                                                                                                                the specifications. As such, information that is extracted
  25                                                                                               majority
        1
                    5                5
                                           1     1
                                                                             4
                                                                                              1      1    3
                                                                                                                from different sources might lead to disagreements on
  20                                                        7                           7
                         10 11                         8         8
                                                                                                                selections.
Counts


             14                                                       12          12
  15
       22 23          21                  22 23 22 22              22                        22 22  22 23 22       Moreover, subjectivity of a facet could also contribute
  10               18       18      18 20                20 19              19 20 17 18                  20

   5          9
                13       13 12 15                     15 16 15
                                                                      11
                                                                         15
                                                                                  11
                                                                                       16
                                                                                                                to value divergences. This might be a reasonable ex-
        Vp Vc Vf Vo Vs Vl Cj Cc Cp Cl Cv Ca Fn Fm Fp Fo Fr Fu                                                   planation for inconsistent selections in Vc, Ca and Fm
                                                                                                                since they allow raters more space for subjective inter-
                                   Facets Agreements for the Round A
  16                                                                                               disagreement
                                                                                                   agreement
                                                                                                                pretations. While relevant information w.r.t. Vc and
  14
        1                                  1     1                                            1
                                                                                                   majority
                                                                                                     1
                                                                                                                Fm is often stated in the EI specifications, these state-
  12                                 3                                       2          3                 3
  10
                   4
                         6     6                      5    6                                                    ments can somehow be interpreted in different degrees
Counts


             8                                                   7    8
   8
       12 13          12   12          12 12 13 12 12 13           13          12
                                                                                  10
                                                                                          12 12 12  12 13 12    or ways. For example, an EI for natural language under-
   6                                                                        11
   4
                9 9
                         7     7
                                 10 10
                                                      8    7
                                                              10
                                                                 6
                                                                         9           9 10                10
                                                                                                                standing (NLU) could aim at improving state-of-the-art
             5                                                        5
   2
        Vp Vc Vf Vo Vs Vl Cj Cc Cp Cl Cv Ca Fn Fm Fp Fo Fr Fu
                                                                                   3
                                                                                                                performance (short-term) or measuring agents’ capabili-
                                                                                                                ties regarding NLU (long-term); object recognition could
                                      Facets Agreements for Round B
                                                                                                   disagreement be argued as a visual capability or a specific task. Having
  10                                                                                               agreement
                                                                                                                both option variability and subjectivity made the three
   8
                                                                                                                facets the least agreed ones. Also, some facets are re-
Counts


   6
       10                                   10 10                                               10 10 10
                                                                                                                lated, and a disagreement in one may be accompanied
                    9                                         9    9
   4
                          6
                                      8
                                                         7
                                                                         6
                                                                               8     8
                                                                                          6
                                                                                                                with   disagreement in others. For instance, when TASK-
   2          4
                                5
                                                                                                                PERFORMANCE is selected for Vc, the value of the Vs facet
       Vp Vc Vf Vo Vs Vl Cj Cc Cp Cl Cv Ca Fn Fm Fp Fo Fr Fu
                                                                                                                is more likely to be SPECIFIC. As such, Vs is more likely to
Figure 1: Agreements on facet value ratings for the 23 EIs be diverged if disagreement occurred on Vc. This might
and rounds A and B.                                                                                             also account for the high diverging rate of facets in the
                                                                                                                Validity group.
   In summary, apart from the statistical reason given by        ments. Surprisingly, only around half of EIs were SPE-
the number of values and their variability, the causes for       CIFIC (Vs), i.e., another half were CONTAMINATED. All
disagreement can be grouped into three blocks:                   the EIs that were designed for TASK-PERFORMANCE are
• Similarity between facet values: The closeness or simi-        always SPECIFIC (this is suggested in the rubric) but more
  larity between facet options might have also reduced           interestingly, most EIs designed to measure CAPABILITY
  the chance of picking the right option. For example,           are CONTAMINATED (i.e., the results do not completely
  for the facet Vl - Realism has four options (TOY, GAMI-        align with what is meant to be measured). More effort is
  FIED, REALISTIC and REAL-LIFE), and it is not always           needed to encourage reliable and robust methodologies
  easy to distinguish between REALISTIC and REAL-LIFE.           to evaluate the capability of the AI systems, although
• Insufficient Details: For many EIs, the information or         we recognise sometimes it is inevitably hard to measure
  details provided by the organisers of the competition,         reliably certain capabilities (e.g., common-sense reason-
  the test or the datasets in the EI is not sufficient to        ing). With regard to realism (VI), REALISTIC EIs account
  understand what the EI is actually measuring. Other            for a predominant proportion (circa 80%), implying con-
  EIs are well documented and have published articles            siderable focus on measuring systems solving practical
  that make it easy to obtain meta-information and the           problems, but the evaluation is not in an actual real-life
  facets values for such EIs.                                    scenario; thus most EIs focus on evaluating the systems
• Conflicting Information: One of the factors that did           in simulated scenarios or scenarios which are an abstrac-
  not help is the source of information about each EI.           tion of a real-world setting.
  For some EIs, there is perhaps too much information               Consistency group (Does it measure it effectively
  and many papers using them, and they do not always             and verifiably?): Nearly all EIs are FULLY-CONTAINED
  understand the same thing or use it in the same way.           (Cc), implying current EIs enjoy high independence from
  One paper or website might be talking about task per-          external factors during the assessment) and RELIABLE
  formance while other sources talk of capabilities or           (Cl), which are desirable features. Regarding Cj, most
  both.                                                          EIs evaluate the systems with an AUTOMATED scoring
Overall, given these sources and level of disagreement, as       instead of MANUAL or MIXED. This phenomenon can be
shown in Figure 1, we considered the rubric sufficiently         double-edged since automated scoring is generally more
validated to move from round A to round B with fewer             objective and faster to calculate but also requires a proper
raters, and for the analysis in the next section.                definition for the scoring10 . For instance, how do we
                                                                 use an automated scoring to evaluate whether a robotic
                                                                 dancer or cook is good or bad? This may be easy for some
5. Analysis of Results                                           human experts but quite hard to define using a metric.
                                                                 Things become particularly complicated when measur-
Herein, we break down the results obtained by the raters
                                                                 ing a special capability, such as common-sense reasoning.
to describe what they reveal about the 23 selected EIs (Ta-
                                                                 In terms of Cv, nearly all EIs are FIXED datasets. Almost
ble 2). Figure 2 shows the frequencies of different options
                                                                 none had altered the instances by adding post-processing
of the 23 EIs for each of the 18 facets. The frequency is
                                                                 variations or created new to cover a range of variations
calculated differently in the first and the second round. In
                                                                 intrinsically, possibly because using fixed datasets is eas-
round A, since we have four raters, each counts for 0.25
                                                                 ier than modifying instances systematically. However,
unit of frequency (if all chose the same option, it sums
                                                                 this could obstruct the diversity in the evaluation method-
up to 1). In round B, we have two raters, each counts for
                                                                 ology (e.g., sometimes it would be interesting to see how
0.5. In total, we have a maximum frequency of 23 in each
                                                                 the system’s performance varies by adding noise to the
option.
                                                                 data to test the model’s robustness). Surprisingly, most
   Validity group (Does it measure what we want
                                                                 EIs are UNSTRUCTURED or ABLATABLE (Ca), but almost
to measure?): Nearly all EIs are designed to foster RE-
                                                                 none are ADAPTIVE. This might be because adaptive tests
SEARCH (Vp) and use ABSOLUTE metrics (a preferred
                                                                 are much more difficult to operate and require an under-
option in Vf). The number of EIs dedicated to measure
                                                                 standing of what the most informative instances are.
performance on a concrete task and EIs aiming to mea-
                                                                    Fairness group (Does it treat all test takers
sure a capability is similar (Vc), which suggests that the
                                                                 equally?): EIs that are IMPARTIAL account for 80% of the
field (at least as represented by these 23 EIs) is undecided
                                                                 data (Fp), which seems a good indicator. However, the
on whether to evaluate performance or capabilities. In
                                                                 actual value might be even lower since it is often hard
Vo, most EIs were classified as REPRESENTATIVE. How-
                                                                 to detect impartiality. For instance, in an EI for bench-
ever, the percentage of BIASED EIs is still significant (circa
                                                                 marking clinical decision support systems, the training
25%), suggesting more efforts may be needed to improve
the coverage of current (as well as the ones to come) EIs      10
                                                                  Easy scoring gives an impression of higher objectivity but some
to mitigate/avoid unrepresentative and unreliable assess- subjectivity still exists in the choice of the metric itself. Automated
                                                                  scoring usually helps with repeatability and traceability.
set may only include Latin American patients but there
are patients from other regions in the test set. Interest-
ingly, virtually all the analysed EIs are classified as FULLY-
INDEPENDENT (Fo), as values CUSTOMISED and LOOSE
are only 0.25 (i.e., these options were only chosen once).
The fact that current EIs have the same predetermined
specification for all assessed systems is positive and a
characteristic that favours fairness in evaluation. Nearly
all EIs evaluate the AI systems statically rather than de-
velopmentally, possibly because for many applications
we care more about the final performance rather than
how the system’s performance evolves. Also, it is easier
to evaluate the former than the latter. However, DEVEL-
OPMENTAL EIs could give more insights about how the
models are learning with variations of the input features
and different curricula, detect when and why the things
go wrong during the training phase, and the trade-off
between number of instances, time and performance.
   In summary, in the validity of the EIs, we found that
most of the selected EIs that measure a capability do not
necessarily measure the capability reliably. Still, these
failures could serve as excellent future references for
developing more robust frameworks for evaluating ca-
pabilities, and more efforts are required in the years to
come. Also, we still need to improve the coverage (i.e.,
representativeness) in the current EIs. In addition to that,
the development of more EIs with real-life settings, may
encourage the development of AI systems better able to           Figure 2: The distribution of the options in all facets.
operate in real-life situations.
   Regarding the consistency group: albeit most of the
selected EIs measure effectively and verifiably, as they are     the history of AI, ImageNet, is the only one where the
FULLY-CONTAINED and RELIABLE, there is still an evident          value PARTIAL is chosen by (at least) half of the raters,
lack of diversity in the evaluation process. For instance,       and also the one with all BIASED values chosen in cov-
we may need more EIs focusing on altering instances by           erage (along with LibriSpeech). The disagreement in
adding post-processing variations or creating instances          partiality may suggest that some sources of partiality are
to cover a range of variations intrinsically. Also, more         only discovered after the repeated use of an EI and not
adaptive ways to test a system should be encouraged, in          identified by everyone immediately. GVGAI is peculiar
order to evaluate how the system copes in circumstances          as a well-thought-out EI, where video games are ablat-
with different difficulties. Finally, in terms of fairness,      able by several characteristics or difficulty of the game.
the selected EIs enjoy low partiality and high objectiv-         This is also going in the direction of being procedural,
ity. However, more efforts are needed in spurring EIs to         but still to a limited extent as per the values assigned by
also focus on evaluating how a system performs during            the raters for this EI. Finally, those EIs related to natu-
the development process. Furthermore, the community              ral language, and especially WSC, GLUE, SUPERGLUE,
may need more benchmarks that focus on humans and                Physical IQa, SocialQA, SQUAD2.0, WikiQA and sW/AG
machines working together, since only one out of 23 EIs          have high degrees of CONTAMINATED values in facet
were done this way.                                              SPECIFICITY. This might be a reflection of how difficult
   When looking at the distribution of facet values per          it is to isolate particular capabilities when using natural
EI, we can see that those related to robotics and the phys-      language, as some basic natural language competency
ical world (Robocup SPL, Robocup@Home and lifelong-              requires many other things. And this is reflected by the
robots) have more variability in judgeability (MANUAL            success of language models recently doing a variety of
becomes more frequent), realism (REALISTIC and REAL-             tasks [31, 32, 33, 34], since mastering natural language
LIFE also become more frequent) and containedness                seems to be contaminated by so many other capabilities
(PARTIAL-INTERFERENCE becoming more common), as                  and skills.
well as autonomy, with the COUPLED value being cho-
sen in some of them. One of the most popular EIs in
6. Discussion and Conclusions                                  here to the order of hundreds in the future, with a more
                                                               diverse and numerous pool of raters. As an immediate
In section 4 we have seen disagreement between CAPA- continuation of this work ourselves, we plan to apply
BILITY and PERFORMANCE (Vc), between SPECIFIC and the rubric to further EIs. We hope these facets and the
CONTAMINATED (Vs), and between UNSTRUCTURED and rubric describing them can help track the evolution of AI
ABLATABLE (Ca). The distributions of these facets in sec- evaluation in the years to come, and identify the facets
tion 5 may illustrate a difficulty in interpreting what the where changes are happening or should happen.
EI designers intended, i.e., a lack of clarity in the specifi-
cation of the EI. It may also be a sign of unresolved issues
in AI evaluation: going from task-oriented evaluation
                                                               Acknowledgments
based on performance to more general EIs leads to SPECI- We thank the anonymous reviewers for their comments.
FICITY problems. For instance, adding many millions The development of this rubric was performed in the con-
of examples can help to coverage but comes with prob- text of the OECD AI and Future of Skills project. Several
lems of specificity and more difficulty in understanding versions of the facets were discussed in a series of meet-
the role each example plays in the overall score being ings within the project, and especially two meeting in
measured by the EI.                                            July 5th 2021 and October 26th, where we presented pre-
   Being aware of the consistency issues of the rating liminary versions of this rubric. In particular, we thank
methodology, we think the set of facets and associated the OECD team (Stuart Elliott, Abel Baret, Margarita
rubric, as well as the results of the study of 23 EIs reported Kalamova, Nóra Révai, Mila Staneva) and the rest of ex-
in this paper, can be useful for three different kinds of perts and participants (Guillaume Avrin, Lucy Cheke,
users in slightly different ways. First, EI creators can see Kenneth D. Forbus, Yvette Graham, Patrick Kyllonen,
what design choices in their EI to modify from a first eval- Elena Messina, Britta Rüschoff, Michael Schönstein, Jim
uation of its facets and see how it compares to other EIs. Spohrer and Swen Ribeiro). We also thank the OECD for
For AI system developers, they can choose the right EIs the funding which made this work possible as well as
according to the facet values, and better understand what their encouragement.
they can expect from the evaluation and what it means
exactly. Finally, for policy-makers and stakeholders from
academia, scientific publishing, industry, government References
and other strategic organisations, an increasing number
of EIs being evaluated and catalogued can serve to under- [1] A. Turing, Computing machinery and intelligence,
stand the landscape of AI evaluation much better. This               Mind 59 (1950) 433.
can help them recognise gaps and limitations, beyond the        [2]  S. M. Shieber, Principles for designing an AI compe-
unstructured collections of benchmark results by metric              tition, or why the Turing Test fails as an inducement
that have become very useful for meta-analysis but still             prize, AI Magazine 37 (2016) 91–96.
lacking structure and insight about the EIs themselves.         [3]  S. M. Shieber, Lessons from a restricted Turing Test,
   In fact, there have been several studies focusing on              Commun. ACM 37 (1994) 70–78.
numeric comparison and the evolution of performance             [4]  P. Hayes, K. Ford, Turing test considered harm-
for a range of EIs [35, 36]. These studies see the evolution         ful, in: International Joint Conference on Artificial
of the progress of AI systems according to some metrics,             Intelligence (IJCAI), 1995, pp. 972–977.
but we need more analysis on how the evaluation in-             [5]  A. G. Cohn, On evaluating artificial intelligence
struments (benchmarks, competitions, standards, tests,               systems: Competitions and benchmarks, in: AI
etc.) are also evolving, and whether they are meeting the            and the Future of Skills, Volume 1 Capabilities and
demands of a more comprehensive evaluation beyond                    Assessments,   OECD, 2021, pp. 238–251.
some simple metrics. This was our main motivation.              [6] AERA, APA, NCME, et al., Standards for educa-
   We have faced some difficulties in determining the                tional and psychological testing, American Educa-
criteria for inclusion of EIs, the isolation of some facets          tional Research Association, 2014.
that were difficult to understand or confused with others,      [7]  H. J. Levesque, The Winograd Schema Chal-
and finding a protocol of application that is sufficiently           lenge, in: Logical Formalizations of Common-
robust but at the same time requiring a limited number               sense Reasoning, Papers from the 2011 AAAI
of raters and other resources. We plan this setting to be a          Spring Symposium, Technical Report SS-11-06,
live endeavour, with some facets being added, changed or             Stanford, California, USA, March 21-23, 2011, AAAI,
removed in new versions of the rubric. However, some                 2011. URL: http://www.aaai.org/ocs/index.php/SSS/
stability in names, facet values and facet description is            SSS11/paper/view/2502.
needed to be able to compile the results of different rating    [8] M. G. Bellemare, Y. Naddaf, J. Veness, M. Bowling,
studies over time, increasing from the 23 EIs evaluated              The arcade learning environment: An evaluation
     platform for general agents, J. Artif. Intell. Res.              and Computational Intelligence 3 (2019) 1–191.
     47 (2013) 253–279. URL: https://doi.org/10.1613/jair.            https://gaigresearch.github.io/gvgaibook/.
     3912. doi:10.1613/jair.3912 .                               [18] Y. Bisk, R. Zellers, R. LeBras, J. Gao, Y. Choi, PIQA:
 [9] A. Wang, A. Singh, J. Michael, F. Hill, O. Levy,                 reasoning about physical commonsense in natural
     S. R. Bowman, GLUE: A multi-task benchmark and                   language, in: The Thirty-Fourth AAAI Conference
     analysis platform for natural language understand-               on Artificial Intelligence, AAAI 2020, The Thirty-
     ing, in: 7th International Conference on Learn-                  Second Innovative Applications of Artificial Intel-
     ing Representations, ICLR 2019, New Orleans, LA,                 ligence Conference, IAAI 2020, The Tenth AAAI
     USA, May 6-9, 2019, OpenReview.net, 2019. URL:                   Symposium on Educational Advances in Artificial
     https://openreview.net/forum?id=rJ4km2R5t7.                      Intelligence, EAAI 2020, New York, NY, USA, Febru-
[10] A. Wang, Y. Pruksachatkun, N. Nangia, A. Singh,                  ary 7-12, 2020, AAAI Press, 2020, pp. 7432–7439.
     J. Michael, F. Hill, O. Levy, S. R. Bowman, Super-               URL: https://ojs.aaai.org/index.php/AAAI/article/
     glue: A stickier benchmark for general-purpose                   view/6239.
     language understanding systems, in: H. M. Wal-              [19] N. Froleyks, M. Heule, M. Iser, M. Järvisalo, M. Suda,
     lach, H. Larochelle, A. Beygelzimer, F. d’Alché-Buc,             SAT competition 2020, Artif. Intell. 301 (2021)
     E. B. Fox, R. Garnett (Eds.), Advances in Neural                 103572. URL: https://doi.org/10.1016/j.artint.2021.
     Information Processing Systems 32: Annual                        103572. doi:10.1016/j.artint.2021.103572 .
     Conference on Neural Information Processing                 [20] R. Zellers, Y. Bisk, A. Farhadi, Y. Choi, From
     Systems 2019, NeurIPS 2019, December 8-14, 2019,                 recognition to cognition: Visual commonsense
     Vancouver, BC, Canada, 2019, pp. 3261–3275. URL:                 reasoning, in: IEEE Conference on Computer
     https://proceedings.neurips.cc/paper/2019/hash/                  Vision and Pattern Recognition, CVPR 2019, Long
     4496bf24afe7fab6f046bf4923da8de6-Abstract.html.                  Beach, CA, USA, June 16-20, 2019, Computer Vision
[11] J. Deng, W. Dong, R. Socher, L. Li, K. Li, L. Fei-               Foundation / IEEE, 2019, pp. 6720–6731. URL:
     Fei, Imagenet: A large-scale hierarchical image                  http://openaccess.thecvf.com/content_CVPR_2019/
     database, in: 2009 IEEE Computer Society Con-                    html/Zellers_From_Recognition_to_Cognition_
     ference on Computer Vision and Pattern Recogni-                  Visual_Commonsense_Reasoning_CVPR_2019_
     tion (CVPR 2009), 20-25 June 2009, Miami, Florida,               paper.html. doi:10.1109/CVPR.2019.00688 .
     USA, IEEE Computer Society, 2009, pp. 248–255.              [21] Assembly         performance         metrics       and
     URL: https://doi.org/10.1109/CVPR.2009.5206848.                  test        methods,          https://www.nist.gov/
     doi:10.1109/CVPR.2009.5206848 .                                  el/intelligent-systems-division-73500/
[12] J. Renz, X. Ge, M. Stephenson, P. Zhang, AI meets                robotic-grasping-and-manipulation-assembly/
     angry birds, Nat. Mach. Intell. 1 (2019) 328. URL:               assembly, 2018.
     https://doi.org/10.1038/s42256-019-0072-x. doi:10.          [22] A. L. Maas, R. E. Daly, P. T. Pham, D. Huang, A. Y.
     1038/s42256- 019- 0072- x .                                      Ng, C. Potts, Learning word vectors for sentiment
[13] S. A. Gaggl, T. Linsbichler, M. Maratea, S. Woltran,             analysis, in: D. Lin, Y. Matsumoto, R. Mihalcea
     Design and results of the second international com-              (Eds.), The 49th Annual Meeting of the Association
     petition on computational models of argumenta-                   for Computational Linguistics: Human Language
     tion, Artif. Intell. 279 (2020). URL: https://doi.org/10.        Technologies, Proceedings of the Conference, 19-24
     1016/j.artint.2019.103193. doi:10.1016/j.artint.                 June, 2011, Portland, Oregon, USA, The Association
     2019.103193 .                                                    for Computer Linguistics, 2011, pp. 142–150. URL:
[14] The robocup standard platform league, https://spl.               https://aclanthology.org/P11-1015/.
     robocup.org/, 1998.                                         [23] M. Sap, H. Rashkin, D. Chen, R. LeBras, Y. Choi,
[15] The robocup@home league, https://athome.                         Socialiqa: Commonsense reasoning about social in-
     robocup.org/, 2006.                                              teractions, CoRR abs/1904.09728 (2019). URL: http:
[16] V. Panayotov, G. Chen, D. Povey, S. Khudanpur, Lib-              //arxiv.org/abs/1904.09728. arXiv:1904.09728 .
     rispeech: An ASR corpus based on public domain              [24] M. R. Genesereth, N. Love, B. Pell, General
     audio books, in: 2015 IEEE International Confer-                 game playing: Overview of the AAAI competition,
     ence on Acoustics, Speech and Signal Processing,                 AI Mag. 26 (2005) 62–72. URL: https://doi.org/10.
     ICASSP 2015, South Brisbane, Queensland, Aus-                    1609/aimag.v26i2.1813. doi:10.1609/aimag.v26i2.
     tralia, April 19-24, 2015, IEEE, 2015, pp. 5206–5210.            1813 .
     URL: https://doi.org/10.1109/ICASSP.2015.7178964.           [25] P. Rajpurkar, J. Zhang, K. Lopyrev, P. Liang, Squad:
     doi:10.1109/ICASSP.2015.7178964 .                                100, 000+ questions for machine comprehension
[17] D. Perez-Liebana, S. M. Lucas, R. D. Gaina, J. To-               of text, in: J. Su, X. Carreras, K. Duh (Eds.),
     gelius, A. Khalifa, J. Liu, General video game arti-             Proceedings of the 2016 Conference on Empirical
     ficial intelligence, Synthesis Lectures on Games                 Methods in Natural Language Processing, EMNLP
     2016, Austin, Texas, USA, November 1-4, 2016, The     arXiv:2108.07258, 2021.
     Association for Computational Linguistics, 2016, [35] F. Martinez-Plumed, P. Barredo, S. O. Heigeartaigh,
     pp. 2383–2392. URL: https://doi.org/10.18653/v1/      J. Hernandez-Orallo, Research community dynam-
     d16-1264. doi:10.18653/v1/d16- 1264 .                 ics behind popular AI benchmarks, Nature Machine
[26] Y. Yang, W. Yih, C. Meek, WikiQA: A challenge         Intelligence 3 (2021) 581–589.
     dataset for open-domain question answering, in: [36] A. Barbosa-Silva, S. Ott, K. Blagec, J. Brauner,
     L. Màrquez, C. Callison-Burch, J. Su, D. Pighin,      M. Samwald, Mapping global dynamics of bench-
     Y. Marton (Eds.), Proceedings of the 2015 Confer-     mark creation and saturation in artificial intelli-
     ence on Empirical Methods in Natural Language         gence, arXiv preprint arXiv:2203.04592 (2022).
     Processing, EMNLP 2015, Lisbon, Portugal, Septem-
     ber 17-21, 2015, The Association for Computa-
     tional Linguistics, 2015, pp. 2013–2018. URL: https:
     //doi.org/10.18653/v1/d15-1237. doi:10.18653/v1/
     d15- 1237 .
[27] R. Zellers, Y. Bisk, R. Schwartz, Y. Choi, SWAG: A
     large-scale adversarial dataset for grounded com-
     monsense inference, in: E. Riloff, D. Chiang,
     J. Hockenmaier, J. Tsujii (Eds.), Proceedings of the
     2018 Conference on Empirical Methods in Natu-
     ral Language Processing, Brussels, Belgium, Octo-
     ber 31 - November 4, 2018, Association for Compu-
     tational Linguistics, 2018, pp. 93–104. URL: https:
     //doi.org/10.18653/v1/d18-1009. doi:10.18653/v1/
     d18- 1009 .
[28] A. Marot, B. Donnot, G. Dulac-Arnold, A. Kelly,
     A. O’Sullivan, J. Viebahn, M. Awad, I. Guyon, P. Pan-
     ciatici, C. Romero, Learning to run a power network
     challenge: a retrospective analysis, in: H. J. Es-
     calante, K. Hofmann (Eds.), NeurIPS 2020 Competi-
     tion and Demonstration Track, 6-12 December 2020,
     Virtual Event / Vancouver, BC, Canada, volume
     133 of Proceedings of Machine Learning Research,
     PMLR, 2020, pp. 112–132. URL: http://proceedings.
     mlr.press/v133/marot21a.html.
[29] L. Yang, Sdkd: Saliency detection with knowledge
     distillation, https://lifelong-robotic-vision.github.
     io/competition/papers/PekingU_linyang.pdf, 2019.
[30] C.-C. Hsu, B. A. Sandford, The Delphi technique:
     making sense of consensus, Practical assessment,
     research, and evaluation 12 (2007) 10.
[31] J. Devlin, M.-W. Chang, K. Lee, K. Toutanova,
     Bert: Pre-training of deep bidirectional transform-
     ers for language understanding, arXiv preprint
     arXiv:1810.04805 (2018).
[32] T. B. Brown, B. Mann, N. Ryder, M. Subbiah, J. Ka-
     plan, P. Dhariwal, A. Neelakantan, P. Shyam, G. Sas-
     try, A. Askell, et al., Language models are few-shot
     learners, in: Advances in Neural Information Pro-
     cessing Systems, volume 33, 2020, pp. 1877–1901.
[33] D. Hendrycks, S. Basart, S. Kadavath, M. Mazeika,
     A. Arora, E. Guo, C. Burns, S. Puranik, H. He,
     D. Song, J. Steinhardt, Measuring coding challenge
     competence with APPS, 2021. arXiv:2105.09938 .
[34] R. Bommasani, et al., On the opportunities
     and risks of foundation models, arXiv preprint