<!DOCTYPE article PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Archiving and Interchange DTD v1.0 20120330//EN" "JATS-archivearticle1.dtd">
<article xmlns:xlink="http://www.w3.org/1999/xlink">
  <front>
    <journal-meta>
      <journal-title-group>
        <journal-title>Y. Moros-Daval);</journal-title>
      </journal-title-group>
    </journal-meta>
    <article-meta>
      <contrib-group>
        <contrib contrib-type="author">
          <string-name>Anthony G Cohn</string-name>
          <email>a.g.cohn@leeds.ac.uk</email>
          <xref ref-type="aff" rid="aff0">0</xref>
          <xref ref-type="aff" rid="aff3">3</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>José Hernández-Orallo</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
          <xref ref-type="aff" rid="aff4">4</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Julius Sechang Mboli</string-name>
          <email>mboli4god@gmail.com</email>
          <xref ref-type="aff" rid="aff0">0</xref>
          <xref ref-type="aff" rid="aff1">1</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Yael Moros-Daval</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
          <xref ref-type="aff" rid="aff4">4</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Zhiliang Xiang</string-name>
          <email>xiangz6@cardif.ac.uk</email>
          <xref ref-type="aff" rid="aff0">0</xref>
          <xref ref-type="aff" rid="aff2">2</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Lexin Zhou</string-name>
          <email>lzhou@inf.upv.es</email>
          <xref ref-type="aff" rid="aff0">0</xref>
          <xref ref-type="aff" rid="aff4">4</xref>
        </contrib>
        <aff id="aff0">
          <label>0</label>
          <institution>Evaluation Instruments, Comparison of Evaluation Instruments, Categorisation of Evaluation Instruments</institution>
          ,
          <addr-line>Artificial Intelli-</addr-line>
        </aff>
        <aff id="aff1">
          <label>1</label>
          <institution>Faculty of Engineering and Informatics, University of Bradford</institution>
          ,
          <country country="UK">UK</country>
        </aff>
        <aff id="aff2">
          <label>2</label>
          <institution>IROHMS, School of Computer Science and Informatics, Cardif University</institution>
          ,
          <country country="UK">UK</country>
        </aff>
        <aff id="aff3">
          <label>3</label>
          <institution>School of Computing, University of Leeds, UK; and the Turing Institute</institution>
          ,
          <country country="UK">UK</country>
        </aff>
        <aff id="aff4">
          <label>4</label>
          <institution>VRAIN, Universitat Politècnica de València</institution>
          ,
          <country country="ES">Spain</country>
        </aff>
      </contrib-group>
      <pub-date>
        <year>2055</year>
      </pub-date>
      <volume>000</volume>
      <fpage>0</fpage>
      <lpage>0002</lpage>
      <abstract>
        <p>The current and future capabilities of Artificial Intelligence (AI) are typically assessed with an ever increasing number of benchmarks, competitions, tests and evaluation standards, which are meant to work as AI evaluation instruments (EI). These EIs are not only increasing in number, but also in complexity and diversity, making it hard to understand this evaluation landscape in a meaningful way. In this paper we present an approach for categorising EIs using a set of 18 facets, accompanied by a rubric to allow anyone to apply the framework to any existing or new EI. We apply the rubric to 23 EIs in diferent domains through a team of raters, and analyse how consistent the rubric is and how well it works to distinguish between EIs and map the evaluation landscape in AI.</p>
      </abstract>
    </article-meta>
  </front>
  <body>
    <sec id="sec-1">
      <title>1. Introduction</title>
      <p>Ever since researchers started building AI systems, they
have wanted to evaluate them, either against human
benchmarks (such as playing humans experts at Chess or
other games) and/or against other AI systems. Finding
good benchmarks for evaluating systems, and conducting
tests is harder than it might seem, particularly since we
believe we have good methods for evaluating human
intelligence, via standard tests and examinations.</p>
      <sec id="sec-1-1">
        <title>There have been many tests proposed for evaluating</title>
      </sec>
      <sec id="sec-1-2">
        <title>AI systems. Probably the most famous of these of course</title>
        <p>is known as the Turing Test[1]. There have been various</p>
      </sec>
      <sec id="sec-1-3">
        <title>Turing Test competitions, of which the best known is the annual Loebner Prize competition; the results have been sometimes entertaining, and a way of promulgating ideas about AI to the general public, but it is hard to argue that</title>
        <p>nEvelop-O
LGOBE
https://jsmboli.github.io/jsmboli/ (J. S. Mboli);
https://zl-xiang.github.io/ (Z. Xiang); https://lexzhou.github.io/
(L. Zhou)</p>
        <p>0000-0002-7652-8907 (A. G. Cohn); 0000-0001-9746-7632
(J. Hernández-Orallo); 0000-0003-1708-3052 (J. S. Mboli);
can be found in [5].</p>
        <p>The situation today is that there are thousands of
challenges in almost all areas of AI. They are increasing in
complexity and diversity, as AI techniques evolve
likewise. Because of this, it is hard to analyse this
evaluation landscape in a meaningful way. Motivated by
this need, we present and discuss an approach to
categorising benchmarks, competitions, tests and evaluation
standards, jointly referred to as AI evaluation instruments
(EI). We do this categorisation via a set of 18 facets, which
we believe will be valuable in distinguishing and
evaluating diferent proposals for evaluating AI systems. These
facets, and an accompanying rubric to facilitate choosing
appropriate values, are described in section 2.</p>
        <p>We will classify EIs using the facets in order to (a)
evaluate how well the facets work in general and (b) to
what extent they help mapping the landscape of EIs and
distinguish their diferences. This may help inform how the options in brackets. Some options indicate ‘(specify)’,
much we can translate from the facet values to guide the which means that the rater must indicate a (freetext)
design of future EIs. We do not imagine there can be a value for that option. The full description of the facets
single universal evaluation instrument, or even a battery usually include some examples and further clarifications 2.
for each domain (vision, reasoning, etc.); certainly that Here we only include the basic definition of each of them.
ideal has eluded the community so far. We do not even We use colours (blue and black) that are indicative, with
aspire to find facet values that are valid for all EIs but blue referring to the preferred or most challenging case,
our proposed work may help in directing future eforts in general. However, for some facets a blue value may
in the evaluation of AI systems. make no sense, or we do not believe that one value is</p>
        <p>Since it is infeasible in a reasonable amount of time to ‘better’ than any other, so these facets have no coloured
apply this categorisation to the thousands of EIs in the lit- facet value(s).
erature, here we cover 23 EIs (see Table 2). By evaluating • Vp - Purpose [RESEARCH, CONFORMITY, OTHER
(speca reasonable number of carefully chosen examples, we ify)]: Is the benchmark meant to foster research or
hope to give a fair picture of the extent to which the as- development, or to certify whether an AI system
conpects of AI appraised by the facets are being tested in the forms with some level or standard?
selected examples. Beyond the insights that we extract • Vc - Capability [TASK-PERFORMANCE (specify),
CAPAfrom this selected set of EIs, this paper and the rubric we BILITY (specify)]: Does the EI just measure observed
have developed for the diferent facets should serve as (aggregated) performance on a TASK (e.g., protein
folda reference for third parties (e.g., other researchers) to ing, credit scoring) or is the EI designed to also measure
analyse other EIs. a CAPABILITY (e.g., object permanence, dealing with</p>
        <p>The rest of the paper is organised as follows. Sec- negation)?
tion 2 presents the 18 facets and a rubric which explains • Vf - Reference [ABSOLUTE, RELATIVE (specify)]: Are
how facet values should be chosen. Next, in section 3, results reported as an absolute metric
(criterionwe discuss the criteria for selecting the 23 EIs and the referenced) or are they reported as a relative
(percentmethodology the raters used to apply the rubric. Section age) metric to a reference (norm referenced), e.g.,
hu4 discusses the level of disagreement between raters for man performance?
each facet and EI, and how the methodology and the • Vo - Coverage [BIASED (specify),
REPRESENTAnumber of raters was adapted based on these observa- TIVE]: Does the EI cover a BIASED or unbiased
tions. Section 5 analyses the ratings of the 23 EIs, and (REPRESENTATIVE) distribution of what is meant to
what they reveal about this group of EIs. Finally, section be measured?
6 closes with some general discussion and possible future • Vs - Specificity [SPECIFIC, CONTAMINATED]: Are the
work. results precisely aligned with what is meant to be
measured or contaminated by other skills or tasks?
• Vl - Realism [TOY, GAMIFIED, REALISTIC, REAL-LIFE]:
2. Characterising AI Evaluation To what extent is the EI a toy problem, a complex
Instruments gamified problem, is it a realistic setting (e.g., but still
in a simulated scenario, a lab or testing facility) or is
the evaluation itself happening in real life3?
• Cj - Judgeability [MANUAL, AUTOMATED, MIXED]: Is
scoring manual (e.g., through human questionnaires
or judges) or automated (e.g., correct answers or
optimality function) or a mixture?
• Cc - Containedness [FULLY-CONTAINED,
PARTIAL</p>
        <sec id="sec-1-3-1">
          <title>INTERFERENCE (specify), NOT-CONTAINED (specify)]:</title>
          <p>Once started, is the testing isolated from external
factors or interference possibly having an efect on the
results (human participants, online data, weather, etc.),
or is there some partial interference not afecting the
results significantly or is it dependent of external
resources and conditions?
We looked for existing features or dimensions to
characterise EIs, but unfortunately we did not find any
systematic account in AI, other than concepts such as
reproducibility, realism, coverage and specificity, usually
referred to with other names and applied to a single EI. We
found more dimensions and a more systematic coverage
of evaluation instruments in the area of psychological
testing. As a result, we have introduced a new set of
facets, but when possible, the terminology is based on
the common use in AI, but also incorporating terms and
concepts from the Standards for Educational and
Psychological Testing by the American Educational Research
Association [6].</p>
          <p>The following list1 proposes 18 facets to characterise
existing and future EIs for AI. Each facet is followed by
1Each facet has both a name and a two letter acronym, whose
initial letter is V, C or F, the reason for which will become clear later.</p>
        </sec>
      </sec>
      <sec id="sec-1-4">
        <title>2The latest version of the rubric can be found in https://tinyurl.</title>
        <p>com/mr2bv5hb</p>
        <p>3REAL-LIFE does not mean a final or specific product in
operation. It can also happen in very early stages of research, such as
evaluating prototype chatbots in a real social network.
• Cp - Reproducibility [NON-REPRODUCIBLE, STOCHAS- the system within the test?</p>
        <p>TIC, EXACT]: Is the evaluation non-reproducible, with • Fu - Autonomy [AUTONOMOUS, COUPLED (specify),
results biased or spoiled if repeated; does the EI have COMPONENT]: Is it measuring an autonomous
sysstochastic components leading to diferent interac- tem, coupled with other systems (e.g., humans) or as a
tions; or are the results completely reproducible, i.e. component?
can exactly the same test (inputs, interaction, etc.) be The facets above can be grouped into three main
catgenerated again for another (or the same) competitor? egories following the three main groups given by the
• Cl - Reliability [RELIABLE, NON-RELIABLE, N/A]: Does Standards for Educational and Psychological Testing [6]:
the evaluation present suficient repetitions, episode validity, reliability/precision and fairness. We use these
length or number of instances to give low variance three major groups to give some structure to the facets
for the same subject when applied again (test-retest above. Roughly, these groups deal with what is measured,
reliability)? If the testing methodology or the common how it is measured and who is measured, respectively.
use of the EI is not clear then N/A may be the most • Validity group (Does it measure what we want to
meaappropriate facet value. sure?): Vp, Vc, Vf, Vo, Vs, Vl
• Cv - Variation [FIXED, ALTERED, PROCEDURAL]: Is • Consistency (Reliability/Precision) group (Does it
meathe evaluation based on fixed datasets; have the in- sure it efectively and verifiably?: Cj, Cc, Cp, Cl, Cv,
stances been altered by adding post-processing varia- Ca
tions (noise, rotations, etc.); or have they been created • Fairness group (Does it treat all test takers equally?):
(e.g., using procedural generation4)? Fn, Fm, Fp, Fo, Fr, Fu
• Ca - Adjustability [UNSTRUCTURED, ABLATABLE, Some of these are closely related, such as {Cv,Ca,Vo} or
ADAPTIVE]: Is the analysis of results on the set of in- {Fo,Cp}. The term accommodation in [6] is “used to
destances unstructured; or has the EI identified a set note changes with which the comparability of scores is
of meta-features such as dificulty or dimension that retained, and the term modification is used to denote
could be used to analyse the results by these dimen- changes that afect the construct measured by the test”.
sions (ablatable); or are these meta-features used to This is related to Vs, Cv, Fo and Cc, and also to the term
adaptively or adversarially choose the instances to test “measurement invariance”, which is very important here
more informatively (adaptive)? to see if accommodations of the same test could
evalu• Fn - Antecedents [CREATED, RETROFITTED (specify)]: ate the same construct for diferent AI systems and even
Is it devised on purpose for AI or adapted from tests humans.</p>
        <p>designed to test humans.
• Fm - Ambition [SHORT, LONG]: When the EI was
created, was it aiming at the short term (improving on 3. EI Selection and Rating
the SOTA) or long term (more ambitious goals)? Methodology
• Fp - Partiality [PARTIAL (specify), IMPARTIAL]: Does
the EI favour particular technologies, conditions or Now that the facets and the rubric have been explained,
cultures that should not have an influence on the result we proceed to discuss how the EIs were selected, what
of the evaluation5? the final selection was, and what protocol we followed
• Fo - Objectivity [LOOSE, CUSTOMISED, FULLY- in assigning EIs to the raters.</p>
        <p>INDEPENDENT]: Is it loosely defined, customised to
each participant or does the EI have a predetermined 3.1. EI Selection
independent specification 6.
• Fr - Progression [STATIC, DEVELOPMENTAL]: Is the
score measuring a capability at one particular moment
or is it evaluating the development of the capability of</p>
      </sec>
      <sec id="sec-1-5">
        <title>We considered evaluation instruments with the following</title>
        <p>criteria for inclusion:
• Potential interest to understand the future of AI skills:</p>
        <p>An EI might be regarded as being of interest if systems
which perform well on it can be regarded as
indicating a noteworthy change in the capabilities of AI in
general. In other words, progress in this EI requires
significant enhancement of AI techniques beyond the
specific requirements of the EI.
• Diversity in the kind of task: We tried to cover a variety
of domains, formats and types of problems (vision,
natural language, competitions, datasets, supervised,
etc).</p>
        <p>4Although we have coloured PROCEDURAL, we recognise
that procedural may not always be better and can lead to problems
if variations are not in an appropriate proportion. Also, generated
data may just lead to a learning algorithm reverse-engineering the
generator.</p>
        <p>5Vo-Coverage is about the domain, whilst Fp-Partiality is
about how the EI may favour some test-takers over others.</p>
        <p>6LOOSE refers to cases when evaluation is very open, e.g., a
robotic-domain EI where we evaluate on a satisfactory interaction
with the user, but not even a clear questionnaire is defined.
FULLYINDEPENDENT could treat diferent groups diferently if there is
a reason for equality of treatment.
Consistently Agreed
Moderately Agreed
Often Diverged
2 options
Fr, Fn
Vf, Fp
Vc, Vo, Vs, Fm
3 options
Vp , Cj, Cc, Fo, Fu
Cp, Cv
Cl, Ca
4 options
Vl
• Popularity: How many teams have already used this</p>
        <p>EI? How many published papers refer to it? We can
use proxies for this, such as citations to the original
papers introducing the EI, the number of results on
websites such as paperswithcode.com. We also have to
consider that industry-related EI may be less popular
than research-oriented EIs. However, given the
number of EIs selected, we repeat domains and cover just
a few areas (e.g., NLP, vision, robotics) without being
comprehensive for all possible domains.
• Currency: we prefer EIs still in active use or recently
introduced, rather than those which have fallen out of
use.</p>
        <p>The source of the EIs was mostly repositories7 and
surveys, institutions such as NIST8 and LNE9, and
competitions at AI conferences. Then, we identified possible
gaps in terms of domains or whether we expect that the
answers for some facets are going to be too similar. We
also considered whether we would expect to get diversity
in the values in blue for the facets, so that we get diferent
levels of quality according to this colour code. Note that
at the time of selection we could of course only roughly
estimate how many blue categories we might get for each
EI. Since we expected to learn more about the
categorising of EIs as categorisation proceeded, we did not choose
all EIs in advance but selected them incrementally. The
23 selected EIs are shown in Table 2.</p>
        <p>These EIs cover a good distribution of benchmarks,
competitions and datasets, although some of them can
be considered to be in two of these categories. The term
‘test’ to refer to an EI is less usual. About half of the
23 EIs require the use of language in the inputs and/or
outputs, and about one half of them require some kind of
perception (mostly computer vision), with some overlap
in these two groups. Only a few of the EIs are related to
navigation and robotics, in virtual (e.g., video games) or
physical environments, and a small number are related to
more abstract capabilities or problems related to planning
or optimisation.</p>
        <p>7http://paperswithcode.com, http://kaggle.com,
https://zenodo.org/record/4647824#.YV7CPdrMKUk, https:
//www.eff.org/ai/metrics, https://en.wikipedia.org/wiki/List_of_
datasets_for_machine-learning_research, http://www.chalearn.org.</p>
        <p>8https://www.nist.gov/programs-projects/
ai-measurement-and-evaluation
9https://www.lne.fr/en/testing/
3.2. Rating Methodology
We devised a protocol to refine and validate the rubric,
but also to cover as many EIs as possible, according to
the number of raters we had available. We explain the
protocol below, but we note that this protocol can be
adapted to other situations or can incorporate ideas from
consensus-based ratings or the Delphi method [30]. First,
two of the authors of this paper (A.C. and J.H-O.) acted
as coordinators for the rating process. A total of four
raters were chosen. Raters were AI-related
undergraduate and graduate students, and were recruited through a
selection process and interviews. They are the other four
authors of this paper (J-S.M., Y. M-D, Z.X. and L.Z.). Once
the raters were appointed, each rater was given some
meta-information about each EI (acronym, name,
major sources, what it measures, etc.) and had to complete
some other general information about each EI. They were
also asked some information about their own completion,
such as time taken (in hours).</p>
        <p>We established three batches, covering 2, 11 and 10
EIs respectively, in the order they are presented in Table
2. The first two EIs had already been used by the
coordinators in developing the list of facets and their values.</p>
        <p>All the subsequent raters started of on these two EIs too
and were given feedback on their chosen values before
proceeding to any further EIs. We refer to these two EIs
as “Batch 1”. The next 11 EIs are referred to as “Batch
2”. These two batches were done by all four raters,
independently. After the analysis of consistency we deemed
suficient to only have two raters per EI. Then, a final
set of 10 EIs, referred to as “Batch 3”, were each rated by
just two raters, for reasons of economy, since we already
had reasonable inter-rater consistency after the end of
batches 1 and 2. The two raters for each EI were assigned
so that all raters would have five EIs, and across their five
EIs, they co-rated with all the other three raters (i.e., one
EI with one other rater and two EIs with each of the other
raters). In this first stage, they worked independently, not
sharing values for any of the facets, and only reporting
questions and partial results to the coordinators.</p>
        <p>There were some changes of the rubric between
batches, especially clarifying the description of some
of the facets, and in a few cases, changes in the number
evaluation-artificial-intelligence-systems
LibrispeechSL12 [16]
GVGAI [17]
PIQA [18]
SAT [19]
VCR [20]
Assembly [21]
IMDb [22]
SocialIQA [23]
GGP [24]
SQUAD2.0 [25]
WikiQA [26]</p>
        <p>Domain
LU, CS, reasoning</p>
        <p>Aim
It was specifically targeted to evaluate common sense reasoning, as an
alternative to the Turing test, arguing conceptual and practical advantages
The original goal was to evaluate “general, domain-independent AI
technology”, by using a diversity of video games, although what it measures more
specifically is unclear.</p>
        <p>The goal of GLUE and superGLUE (an improvement/modified version of
GLUE) is to measure the performance (e.g. accuracy, F1-score) of an AI system
in natural language understanding tasks (Single-Sentence Tasks, Similarity
and Paraphrase Tasks, and Inference Tasks) in English.</p>
        <p>The goal of GLUE and superGLUE (an improvement/modified version of
GLUE) is to measure the performance (e.g. accuracy, F1-score) of an AI system
in natural language understanding tasks (Single-Sentence Tasks, Similarity
and Paraphrase Tasks, and Inference Tasks) in English.</p>
        <p>Aims to measure the visual recognition capability for object recognition,
image classification, and object localisation. The images can contain diferent
numbers of objects (e.g. mammal, bird, fish, vehicle, furniture, tool, flower,
fruit, etc.), occlusions, and clutters (i.e. diversity and noise).</p>
        <p>Measures the planning capability of an agent in a large action space, without
knowing of the physical parameters of objects, situation given by Angry Birds.</p>
        <p>Aims to measure/compare the performance of diferent solvers regarding
argumentation (particularly, reasoning problem that requires logic).</p>
        <p>The aim is to measure &amp; promote improvements in multi-robot (humanoid)
systems by playing soccer matches with robots
aims to measure the performance of the developed AI robots in providing
service with assistive robot technology with high relevance for future personal
domestic applications.</p>
        <p>Aims to provide freely available read speech corpus in English that is suitable
for training and testing speech recognition systems.</p>
        <p>Aimed to systems that can perform well in multiple video games, possibly
without knowing the game in advance and with little to no specific domain
knowledge, as an approximation to artificial general intelligence
Aims to measure physical interaction reasoning about both the prototypical
use of objects (e.g., shoes are used for walking) and non-prototypical but
practically plausible use of objects (e.g., shoes can be used as a doorstop).</p>
        <p>It targets language representations of knowledge traditionally only seen or
experienced.</p>
        <p>Aims to keep progress &amp; further improve the performance &amp; robustness of SAT
solvers, with a history dating back to the early 90s, thanks to the persistent
eforts of the SAT community.</p>
        <p>It aims to measure the ability to infer what is happening in a picture (people’s
actions, goals, etc.) from visual signs which are obvious for humans.</p>
        <p>Identifying key competencies and characteristics of robotic systems using a
robust set of formalized evaluations and benchmarks. To help to match robotic
hand capabilities to end-user needs as well as to help provide developers and
researchers insight for improving their hardware and software designs
Detecting the sentiment of a piece of text
Aimed to measure the social and emotional intelligence of computational
models through multiple choice question answering
General game playing (GGP) is the design of artificial intelligence programs
to be able to play more than one game successfully.</p>
        <p>It aims to measure reading comprehension abilities that allows a system to
get a correct answer to a given question when the solution can be extracted
from the text or abstain from answering otherwise
WIKIQA is a dataset for opendomain question answering
Year
2016
2013
2018
2019
2010
2010
2015
1998
2006
2015
2014
2019
2002
2019
2017
2011
2019
2005
2018
2014
competition</p>
        <p>VG;general AI; PN
competition
competition
competition
competition
dataset
benchmark
dataset
competition
dataset
competition
dataset
benchmark
competition
dataset
benchmark
dataset
dataset,
benchmark
sW/AG [27] NLI, CR Aims to evaluate the performance of a system in grounded commonsense 2018
inference (reasoning about a situation and anticipate what might come next)
by answering multiple choice questions
L2RPN [28] competition SG, AI, PG, PN This challenge aims at testing the potential of AI to address this important 2012
real-world problem for our future.</p>
        <p>Lifelong- competition robotics, CV, RV Provides a robotic vision dataset collected from real time environments to 2019
Robots [29] accelerate both research and applications of visual models for robotics.
Abbreviations: HRIC = Human-Robot-Interaction and Cooperation; NMDE = Navigation and Mapping in dynamic environments; CV = Computer Vision, ABP = Adaptive Behaviors, planning; AA = abstract argumentation; CL = computational logic; VG = video
games; KRRP = knowledge representation; reasoning; planning; RCRPVMASS robotics; cooperation; real-time planning; vision; multiagent systems; strategy; LU = Language understanding; CS = common sense; RM = Robotics in Manufacturing; ARH = Adaptive
Robot hands, MPLT = Manipulation planning based on learning techniques;DiHM = Dexterous in-hand manipulation; RGVELO = Robust grasping with various everyday life objects; SI = social interaction, SIn = social intelligence, EI = emotional intelligence, IR =
inferential reasoning; CR = commonsense reasoning;VR = visual recognition; PN = planning and navigation, SG = Smart Grids, PG = Power Grids, PN = Power networks, PCU = physical commonsense understanding, NLI = natural language inference, RV = Robotic
vision
and/or name of the options. Whenever a change was The pattern of agreement or disagreement amongst
introduced, the raters were informed and had to revisit the raters tend to vary depending on several factors such
their ratings for previous batches. as facet complexity, available information on the EI, and</p>
        <p>In a second and final stage of the process, the coor- so on. In particular, we observe the following:
dinators allowed the raters to exchange opinions, but • Fr, Fn, Vp, Cj, Cc, Fo, Fu are consistently agreed across
they were not asked to reach a consensus, just to identify all batches, with very few disagreements.
possible misunderstandings. From this discussion, a few • Vf, Vl, Cp, Cv, Fp appear to be moderately agreed and
ratings were modified. Unless explicitly stated, we refer supported by a majority (≥ 75%). Notably, Vl has the
to these final ratings in the rest of the paper. largest number of value options, but still agreed well
by a majority.</p>
        <p>• While selections on Vo, Vs and Cl with binary options,
4. Analysis of Rater Consistency are two of the least agreed ones.</p>
        <p>It is not surprising that some of the facets consistently
As noted above, the 1st and 2nd batches difer from batch reached consensus considering the facet values tend to
3 because the former had four raters whilst the latter only distribute towards one single selection (detailed in
Sectwo. Thus, in the former case, a majority agreement can tion 5). For instance, as we will see in the following
be formed with three or four raters agreeing, whilst in section, RESEARCH is picked for the Vp facet with only
batch 3 only when both raters agree; hence ‘majority’ is one disagreement for all rounds. This might reflect the
less statistically significant for the 3rd batch. For simplic- fact that some EIs do not have much variability in their
ity, we will use round A and round B respectively when options. For example, most EIs are indeed proposed for
referring to the first two batches and the 3rd batch. As the purpose of research (Vp), and given the low
varishown in Figure 1, the level of agreement coincides to a ability in the values there cannot be much disagreement
great extent when comparing the results from all batches (the variance of a Bernoulli distribution). As the
variabil(Figure 1, top) with the individual ones from round A ity of facets increases, choosing answers for the facets
(Figure 1, middle) and round B (Figure 1, bottom). It can might require more EI-specific domain knowledge from
be expected that those facets with more possible values the raters. For instance, to make justifiable decisions for
(4) might have more disagreements than those with only facets like Vo and Vs, raters often need to seek related
two possible values, simply for statistical reasons. We can literature for support when the answers were not clear
see that in fact this is not having a big efect, as shown from the specifications of EIs. Whether an EI is specific
in Table 1. (Vs) and general (Vo) enough for the measuring of certain</p>
        <p>Facets Agreements for Both Rounds capabilities is indeed hard to judge depending solely on
tsounC21123005505 21223 19413 15821 110318 111215 15820 21223 21222 18520 17619 18522 112115 14920 112117 17618 21222adm2gi1sa2ra2jeog3errmiet23ye0em2n2tent
tftshroeoleMevmscaoptldireuoeicenfeoirficsvdea.eintrvi,toesnsruogsbeu.jnrAecccseteisssvu.mictyThig,hohiifnstaflomefraamicdgeahttttoicobodneuisltdaahgaarrtleesiaeossmeocxonetnnartbatrsclietboeunedtxeVp Vc Vf Vo Vs Vl Cj Cc Cp Cl Cv Ca Fn Fm Fp Fo Fr Fu planation for inconsistent selections in Vc, Ca and Fm
since they allow raters more space for subjective
intertsonuC111102468462 11213 85 9 49 12 67 12 67 10F1a30c1e2t1s121A3gr11e21e2m58e1n3ts67f1o0r t76h1e3 R85o9un1d211A2 130 9 13012 11212adm1gi1s2ara1jeog3errmiet13ye0em1n2tent
sopFpmtrmeraeerwnnftoiadtassrtiyinmoocsgftea.nannnF(sN.ocsserLotWaUme(tsxe)hheadchoimloroeitupnw-rlltdeetebh,lraeeameivnmi)aEnEnoItaeItrstrfpmipionmerrefceoapnitficsrreauamodttruviaiiironntnanigogldsnlai,safegttnrwaehegtne.neurts.-tsateo’.gdfscee-tVatgauhpcrtneaeea-dbe-anesilrrdit-</p>
        <p>Vp Vc Vf Vo Vs Vl Cj Cc Cp Cl Cv Ca Fn Fm Fp Fo Fr Fu ties regarding NLU (long-term); object recognition could
10 Facets Agreements for Round B adgisraegermeeemntent bbeotahrgoupetdioans vaavriisaubaillictyapaanbdilistuybojrecatisvpietcyificmtaadske.tHheavtihnrgee
tsonuC6482 10 4 9 6 5 8 10 10 7 9 9 6 8 8 6 10 10 10 wlfPaaEitcRteehFdtOs,daRitsnhMadegAralNeeadeCsmiEstaeiasgngrstreeeileneemcdoteetonhdnteferoisnsr..
VoFAcno,lersthomine,assvtyoaamlnbueceeeoaf,acfwccthoehtemesnVpasaTrfneAaiScreeKedt-</p>
        <p>Vp Vc Vf Vo Vs Vl Cj Cc Cp Cl Cv Ca Fn Fm Fp Fo Fr Fu is more likely to be SPECIFIC. As such, Vs is more likely to
Figure 1: Agreements on facet value ratings for the 23 EIs be diverged if disagreement occurred on Vc. This might
and rounds A and B. also account for the high diverging rate of facets in the</p>
        <p>Validity group.</p>
        <p>In summary, apart from the statistical reason given by ments. Surprisingly, only around half of EIs were
SPEthe number of values and their variability, the causes for CIFIC (Vs), i.e., another half were CONTAMINATED. All
disagreement can be grouped into three blocks: the EIs that were designed for TASK-PERFORMANCE are
• Similarity between facet values: The closeness or simi- always SPECIFIC (this is suggested in the rubric) but more
larity between facet options might have also reduced interestingly, most EIs designed to measure CAPABILITY
the chance of picking the right option. For example, are CONTAMINATED (i.e., the results do not completely
for the facet Vl - Realism has four options (TOY, GAMI- align with what is meant to be measured). More efort is
FIED, REALISTIC and REAL-LIFE), and it is not always needed to encourage reliable and robust methodologies
easy to distinguish between REALISTIC and REAL-LIFE. to evaluate the capability of the AI systems, although
• Insuficient Details: For many EIs, the information or we recognise sometimes it is inevitably hard to measure
details provided by the organisers of the competition, reliably certain capabilities (e.g., common-sense
reasonthe test or the datasets in the EI is not suficient to ing). With regard to realism (VI), REALISTIC EIs account
understand what the EI is actually measuring. Other for a predominant proportion (circa 80%), implying
conEIs are well documented and have published articles siderable focus on measuring systems solving practical
that make it easy to obtain meta-information and the problems, but the evaluation is not in an actual real-life
facets values for such EIs. scenario; thus most EIs focus on evaluating the systems
• Conflicting Information: One of the factors that did in simulated scenarios or scenarios which are an
abstracnot help is the source of information about each EI. tion of a real-world setting.</p>
        <p>For some EIs, there is perhaps too much information Consistency group (Does it measure it efectively
and many papers using them, and they do not always and verifiably?): Nearly all EIs are FULLY-CONTAINED
understand the same thing or use it in the same way. (Cc), implying current EIs enjoy high independence from
One paper or website might be talking about task per- external factors during the assessment) and RELIABLE
formance while other sources talk of capabilities or (Cl), which are desirable features. Regarding Cj, most
both. EIs evaluate the systems with an AUTOMATED scoring
Overall, given these sources and level of disagreement, as instead of MANUAL or MIXED. This phenomenon can be
shown in Figure 1, we considered the rubric suficiently double-edged since automated scoring is generally more
validated to move from round A to round B with fewer objective and faster to calculate but also requires a proper
raters, and for the analysis in the next section. definition for the scoring 10. For instance, how do we
use an automated scoring to evaluate whether a robotic
dancer or cook is good or bad? This may be easy for some
5. Analysis of Results human experts but quite hard to define using a metric.</p>
        <p>Things become particularly complicated when
measuring a special capability, such as common-sense reasoning.</p>
        <p>In terms of Cv, nearly all EIs are FIXED datasets. Almost
none had altered the instances by adding post-processing
variations or created new to cover a range of variations
intrinsically, possibly because using fixed datasets is
easier than modifying instances systematically. However,
this could obstruct the diversity in the evaluation
methodology (e.g., sometimes it would be interesting to see how
the system’s performance varies by adding noise to the
data to test the model’s robustness). Surprisingly, most</p>
        <sec id="sec-1-5-1">
          <title>EIs are UNSTRUCTURED or ABLATABLE (Ca), but almost</title>
          <p>none are ADAPTIVE. This might be because adaptive tests
are much more dificult to operate and require an
understanding of what the most informative instances are.</p>
          <p>Fairness group (Does it treat all test takers
equally?): EIs that are IMPARTIAL account for 80% of the
data (Fp), which seems a good indicator. However, the
actual value might be even lower since it is often hard
to detect impartiality. For instance, in an EI for
benchmarking clinical decision support systems, the training
set may only include Latin American patients but there
are patients from other regions in the test set.
Interestingly, virtually all the analysed EIs are classified as
FULLY</p>
        </sec>
        <sec id="sec-1-5-2">
          <title>INDEPENDENT (Fo), as values CUSTOMISED and LOOSE</title>
          <p>are only 0.25 (i.e., these options were only chosen once).</p>
          <p>The fact that current EIs have the same predetermined
specification for all assessed systems is positive and a
characteristic that favours fairness in evaluation. Nearly
all EIs evaluate the AI systems statically rather than
developmentally, possibly because for many applications
we care more about the final performance rather than
how the system’s performance evolves. Also, it is easier
to evaluate the former than the latter. However,
DEVELOPMENTAL EIs could give more insights about how the
models are learning with variations of the input features
and diferent curricula, detect when and why the things
go wrong during the training phase, and the trade-of
between number of instances, time and performance.</p>
          <p>In summary, in the validity of the EIs, we found that
most of the selected EIs that measure a capability do not
necessarily measure the capability reliably. Still, these
failures could serve as excellent future references for
developing more robust frameworks for evaluating
capabilities, and more eforts are required in the years to
come. Also, we still need to improve the coverage (i.e.,
representativeness) in the current EIs. In addition to that,
the development of more EIs with real-life settings, may
encourage the development of AI systems better able to Figure 2: The distribution of the options in all facets.
operate in real-life situations.</p>
          <p>Regarding the consistency group: albeit most of the
selected EIs measure efectively and verifiably, as they are the history of AI, ImageNet, is the only one where the
FULLY-CONTAINED and RELIABLE, there is still an evident value PARTIAL is chosen by (at least) half of the raters,
lack of diversity in the evaluation process. For instance, and also the one with all BIASED values chosen in
covwe may need more EIs focusing on altering instances by erage (along with LibriSpeech). The disagreement in
adding post-processing variations or creating instances partiality may suggest that some sources of partiality are
to cover a range of variations intrinsically. Also, more only discovered after the repeated use of an EI and not
adaptive ways to test a system should be encouraged, in identified by everyone immediately. GVGAI is peculiar
order to evaluate how the system copes in circumstances as a well-thought-out EI, where video games are
ablatwith diferent dificulties. Finally, in terms of fairness, able by several characteristics or dificulty of the game.
the selected EIs enjoy low partiality and high objectiv- This is also going in the direction of being procedural,
ity. However, more eforts are needed in spurring EIs to but still to a limited extent as per the values assigned by
also focus on evaluating how a system performs during the raters for this EI. Finally, those EIs related to
natuthe development process. Furthermore, the community ral language, and especially WSC, GLUE, SUPERGLUE,
may need more benchmarks that focus on humans and Physical IQa, SocialQA, SQUAD2.0, WikiQA and sW/AG
machines working together, since only one out of 23 EIs have high degrees of CONTAMINATED values in facet
were done this way. SPECIFICITY. This might be a reflection of how dificult</p>
          <p>When looking at the distribution of facet values per it is to isolate particular capabilities when using natural
EI, we can see that those related to robotics and the phys- language, as some basic natural language competency
ical world (Robocup SPL, Robocup@Home and lifelong- requires many other things. And this is reflected by the
robots) have more variability in judgeability (MANUAL success of language models recently doing a variety of
becomes more frequent), realism (REALISTIC and REAL- tasks [31, 32, 33, 34], since mastering natural language
LIFE also become more frequent) and containedness seems to be contaminated by so many other capabilities
(PARTIAL-INTERFERENCE becoming more common), as and skills.
well as autonomy, with the COUPLED value being
chosen in some of them. One of the most popular EIs in</p>
        </sec>
      </sec>
    </sec>
    <sec id="sec-2">
      <title>6. Discussion and Conclusions</title>
      <p>here to the order of hundreds in the future, with a more
diverse and numerous pool of raters. As an immediate
In section 4 we have seen disagreement between CAPA- continuation of this work ourselves, we plan to apply
BILITY and PERFORMANCE (Vc), between SPECIFIC and the rubric to further EIs. We hope these facets and the
CONTAMINATED (Vs), and between UNSTRUCTURED and rubric describing them can help track the evolution of AI
ABLATABLE (Ca). The distributions of these facets in sec- evaluation in the years to come, and identify the facets
tion 5 may illustrate a dificulty in interpreting what the where changes are happening or should happen.
EI designers intended, i.e., a lack of clarity in the
specification of the EI. It may also be a sign of unresolved issues Acknowledgments
in AI evaluation: going from task-oriented evaluation
based on performance to more general EIs leads to SPECI- We thank the anonymous reviewers for their comments.
FICITY problems. For instance, adding many millions The development of this rubric was performed in the
conof examples can help to coverage but comes with prob- text of the OECD AI and Future of Skills project. Several
lems of specificity and more dificulty in understanding versions of the facets were discussed in a series of
meetthe role each example plays in the overall score being ings within the project, and especially two meeting in
measured by the EI. July 5th 2021 and October 26th, where we presented
pre</p>
      <p>Being aware of the consistency issues of the rating liminary versions of this rubric. In particular, we thank
methodology, we think the set of facets and associated the OECD team (Stuart Elliott, Abel Baret, Margarita
rubric, as well as the results of the study of 23 EIs reported Kalamova, Nóra Révai, Mila Staneva) and the rest of
exin this paper, can be useful for three diferent kinds of perts and participants (Guillaume Avrin, Lucy Cheke,
users in slightly diferent ways. First, EI creators can see Kenneth D. Forbus, Yvette Graham, Patrick Kyllonen,
what design choices in their EI to modify from a first eval- Elena Messina, Britta Rüschof, Michael Schönstein, Jim
uation of its facets and see how it compares to other EIs. Spohrer and Swen Ribeiro). We also thank the OECD for
For AI system developers, they can choose the right EIs the funding which made this work possible as well as
according to the facet values, and better understand what their encouragement.
they can expect from the evaluation and what it means
exactly. Finally, for policy-makers and stakeholders from
academia, scientific publishing, industry, government References
and other strategic organisations, an increasing number
of EIs being evaluated and catalogued can serve to under- [1] A. Turing, Computing machinery and intelligence,
stand the landscape of AI evaluation much better. This Mind 59 (1950) 433.
can help them recognise gaps and limitations, beyond the [2] S. M. Shieber, Principles for designing an AI
compeunstructured collections of benchmark results by metric tition, or why the Turing Test fails as an inducement
that have become very useful for meta-analysis but still prize, AI Magazine 37 (2016) 91–96.
lacking structure and insight about the EIs themselves. [3] S. M. Shieber, Lessons from a restricted Turing Test,</p>
      <p>In fact, there have been several studies focusing on Commun. ACM 37 (1994) 70–78.
numeric comparison and the evolution of performance [4] P. Hayes, K. Ford, Turing test considered
harmfor a range of EIs [35, 36]. These studies see the evolution ful, in: International Joint Conference on Artificial
of the progress of AI systems according to some metrics, Intelligence (IJCAI), 1995, pp. 972–977.
but we need more analysis on how the evaluation in- [5] A. G. Cohn, On evaluating artificial intelligence
struments (benchmarks, competitions, standards, tests, systems: Competitions and benchmarks, in: AI
etc.) are also evolving, and whether they are meeting the and the Future of Skills, Volume 1 Capabilities and
demands of a more comprehensive evaluation beyond Assessments, OECD, 2021, pp. 238–251.
some simple metrics. This was our main motivation. [6] AERA, APA, NCME, et al., Standards for
educa</p>
      <p>We have faced some dificulties in determining the tional and psychological testing, American
Educacriteria for inclusion of EIs, the isolation of some facets tional Research Association, 2014.
that were dificult to understand or confused with others, [7] H. J. Levesque, The Winograd Schema
Chaland finding a protocol of application that is suficiently lenge, in: Logical Formalizations of
Commonrobust but at the same time requiring a limited number sense Reasoning, Papers from the 2011 AAAI
of raters and other resources. We plan this setting to be a Spring Symposium, Technical Report SS-11-06,
live endeavour, with some facets being added, changed or Stanford, California, USA, March 21-23, 2011, AAAI,
removed in new versions of the rubric. However, some 2011. URL: http://www.aaai.org/ocs/index.php/SSS/
stability in names, facet values and facet description is SSS11/paper/view/2502.
needed to be able to compile the results of diferent rating [8] M. G. Bellemare, Y. Naddaf, J. Veness, M. Bowling,
studies over time, increasing from the 23 EIs evaluated The arcade learning environment: An evaluation
platform for general agents, J. Artif. Intell. Res. and Computational Intelligence 3 (2019) 1–191.
47 (2013) 253–279. URL: https://doi.org/10.1613/jair. https://gaigresearch.github.io/gvgaibook/.
3912. doi:10.1613/jair.3912. [18] Y. Bisk, R. Zellers, R. LeBras, J. Gao, Y. Choi, PIQA:
[9] A. Wang, A. Singh, J. Michael, F. Hill, O. Levy, reasoning about physical commonsense in natural
S. R. Bowman, GLUE: A multi-task benchmark and language, in: The Thirty-Fourth AAAI Conference
analysis platform for natural language understand- on Artificial Intelligence, AAAI 2020, The
Thirtying, in: 7th International Conference on Learn- Second Innovative Applications of Artificial
Inteling Representations, ICLR 2019, New Orleans, LA, ligence Conference, IAAI 2020, The Tenth AAAI
USA, May 6-9, 2019, OpenReview.net, 2019. URL: Symposium on Educational Advances in Artificial
https://openreview.net/forum?id=rJ4km2R5t7. Intelligence, EAAI 2020, New York, NY, USA,
Febru[10] A. Wang, Y. Pruksachatkun, N. Nangia, A. Singh, ary 7-12, 2020, AAAI Press, 2020, pp. 7432–7439.</p>
      <p>J. Michael, F. Hill, O. Levy, S. R. Bowman, Super- URL: https://ojs.aaai.org/index.php/AAAI/article/
glue: A stickier benchmark for general-purpose view/6239.
language understanding systems, in: H. M. Wal- [19] N. Froleyks, M. Heule, M. Iser, M. Järvisalo, M. Suda,
lach, H. Larochelle, A. Beygelzimer, F. d’Alché-Buc, SAT competition 2020, Artif. Intell. 301 (2021)
E. B. Fox, R. Garnett (Eds.), Advances in Neural 103572. URL: https://doi.org/10.1016/j.artint.2021.
Information Processing Systems 32: Annual 103572. doi:10.1016/j.artint.2021.103572.
Conference on Neural Information Processing [20] R. Zellers, Y. Bisk, A. Farhadi, Y. Choi, From
Systems 2019, NeurIPS 2019, December 8-14, 2019, recognition to cognition: Visual commonsense
Vancouver, BC, Canada, 2019, pp. 3261–3275. URL: reasoning, in: IEEE Conference on Computer
https://proceedings.neurips.cc/paper/2019/hash/ Vision and Pattern Recognition, CVPR 2019, Long
4496bf24afe7fab6f046bf4923da8de6-Abstract.html. Beach, CA, USA, June 16-20, 2019, Computer Vision
[11] J. Deng, W. Dong, R. Socher, L. Li, K. Li, L. Fei- Foundation / IEEE, 2019, pp. 6720–6731. URL:
Fei, Imagenet: A large-scale hierarchical image http://openaccess.thecvf.com/content_CVPR_2019/
database, in: 2009 IEEE Computer Society Con- html/Zellers_From_Recognition_to_Cognition_
ference on Computer Vision and Pattern Recogni- Visual_Commonsense_Reasoning_CVPR_2019_
tion (CVPR 2009), 20-25 June 2009, Miami, Florida, paper.html. doi:10.1109/CVPR.2019.00688.
USA, IEEE Computer Society, 2009, pp. 248–255. [21] Assembly performance metrics and
URL: https://doi.org/10.1109/CVPR.2009.5206848. test methods, https://www.nist.gov/
doi:10.1109/CVPR.2009.5206848. el/intelligent-systems-division-73500/
[12] J. Renz, X. Ge, M. Stephenson, P. Zhang, AI meets robotic-grasping-and-manipulation-assembly/
angry birds, Nat. Mach. Intell. 1 (2019) 328. URL: assembly, 2018.
https://doi.org/10.1038/s42256-019-0072-x. doi:10. [22] A. L. Maas, R. E. Daly, P. T. Pham, D. Huang, A. Y.
1038/s42256-019-0072-x. Ng, C. Potts, Learning word vectors for sentiment
[13] S. A. Gaggl, T. Linsbichler, M. Maratea, S. Woltran, analysis, in: D. Lin, Y. Matsumoto, R. Mihalcea
Design and results of the second international com- (Eds.), The 49th Annual Meeting of the Association
petition on computational models of argumenta- for Computational Linguistics: Human Language
tion, Artif. Intell. 279 (2020). URL: https://doi.org/10. Technologies, Proceedings of the Conference, 19-24
1016/j.artint.2019.103193. doi:10.1016/j.artint. June, 2011, Portland, Oregon, USA, The Association
2019.103193. for Computer Linguistics, 2011, pp. 142–150. URL:
[14] The robocup standard platform league, https://spl. https://aclanthology.org/P11-1015/.</p>
      <p>robocup.org/, 1998. [23] M. Sap, H. Rashkin, D. Chen, R. LeBras, Y. Choi,
[15] The robocup@home league, https://athome. Socialiqa: Commonsense reasoning about social
inrobocup.org/, 2006. teractions, CoRR abs/1904.09728 (2019). URL: http:
[16] V. Panayotov, G. Chen, D. Povey, S. Khudanpur, Lib- //arxiv.org/abs/1904.09728. arXiv:1904.09728.
rispeech: An ASR corpus based on public domain [24] M. R. Genesereth, N. Love, B. Pell, General
audio books, in: 2015 IEEE International Confer- game playing: Overview of the AAAI competition,
ence on Acoustics, Speech and Signal Processing, AI Mag. 26 (2005) 62–72. URL: https://doi.org/10.
ICASSP 2015, South Brisbane, Queensland, Aus- 1609/aimag.v26i2.1813. doi:10.1609/aimag.v26i2.
tralia, April 19-24, 2015, IEEE, 2015, pp. 5206–5210. 1813.</p>
      <p>URL: https://doi.org/10.1109/ICASSP.2015.7178964. [25] P. Rajpurkar, J. Zhang, K. Lopyrev, P. Liang, Squad:
doi:10.1109/ICASSP.2015.7178964. 100, 000+ questions for machine comprehension
[17] D. Perez-Liebana, S. M. Lucas, R. D. Gaina, J. To- of text, in: J. Su, X. Carreras, K. Duh (Eds.),
gelius, A. Khalifa, J. Liu, General video game arti- Proceedings of the 2016 Conference on Empirical
ifcial intelligence, Synthesis Lectures on Games Methods in Natural Language Processing, EMNLP
2016, Austin, Texas, USA, November 1-4, 2016, The arXiv:2108.07258, 2021.</p>
      <p>Association for Computational Linguistics, 2016, [35] F. Martinez-Plumed, P. Barredo, S. O. Heigeartaigh,
pp. 2383–2392. URL: https://doi.org/10.18653/v1/ J. Hernandez-Orallo, Research community
dynamd16-1264. doi:10.18653/v1/d16-1264. ics behind popular AI benchmarks, Nature Machine
[26] Y. Yang, W. Yih, C. Meek, WikiQA: A challenge Intelligence 3 (2021) 581–589.
dataset for open-domain question answering, in: [36] A. Barbosa-Silva, S. Ott, K. Blagec, J. Brauner,
L. Màrquez, C. Callison-Burch, J. Su, D. Pighin, M. Samwald, Mapping global dynamics of
benchY. Marton (Eds.), Proceedings of the 2015 Confer- mark creation and saturation in artificial
intellience on Empirical Methods in Natural Language gence, arXiv preprint arXiv:2203.04592 (2022).
Processing, EMNLP 2015, Lisbon, Portugal,
September 17-21, 2015, The Association for
Computational Linguistics, 2015, pp. 2013–2018. URL: https:
//doi.org/10.18653/v1/d15-1237. doi:10.18653/v1/
d15-1237.
[27] R. Zellers, Y. Bisk, R. Schwartz, Y. Choi, SWAG: A
large-scale adversarial dataset for grounded
commonsense inference, in: E. Rilof, D. Chiang,
J. Hockenmaier, J. Tsujii (Eds.), Proceedings of the
2018 Conference on Empirical Methods in
Natural Language Processing, Brussels, Belgium,
October 31 - November 4, 2018, Association for
Computational Linguistics, 2018, pp. 93–104. URL: https:
//doi.org/10.18653/v1/d18-1009. doi:10.18653/v1/
d18-1009.
[28] A. Marot, B. Donnot, G. Dulac-Arnold, A. Kelly,</p>
      <p>A. O’Sullivan, J. Viebahn, M. Awad, I. Guyon, P.
Panciatici, C. Romero, Learning to run a power network
challenge: a retrospective analysis, in: H. J.
Escalante, K. Hofmann (Eds.), NeurIPS 2020
Competition and Demonstration Track, 6-12 December 2020,
Virtual Event / Vancouver, BC, Canada, volume
133 of Proceedings of Machine Learning Research,
PMLR, 2020, pp. 112–132. URL: http://proceedings.</p>
      <p>mlr.press/v133/marot21a.html.
[29] L. Yang, Sdkd: Saliency detection with knowledge
distillation, https://lifelong-robotic-vision.github.</p>
      <p>io/competition/papers/PekingU_linyang.pdf, 2019.
[30] C.-C. Hsu, B. A. Sandford, The Delphi technique:
making sense of consensus, Practical assessment,
research, and evaluation 12 (2007) 10.
[31] J. Devlin, M.-W. Chang, K. Lee, K. Toutanova,</p>
      <p>Bert: Pre-training of deep bidirectional
transformers for language understanding, arXiv preprint
arXiv:1810.04805 (2018).
[32] T. B. Brown, B. Mann, N. Ryder, M. Subbiah, J.
Kaplan, P. Dhariwal, A. Neelakantan, P. Shyam, G.
Sastry, A. Askell, et al., Language models are few-shot
learners, in: Advances in Neural Information
Processing Systems, volume 33, 2020, pp. 1877–1901.
[33] D. Hendrycks, S. Basart, S. Kadavath, M. Mazeika,</p>
      <p>A. Arora, E. Guo, C. Burns, S. Puranik, H. He,
D. Song, J. Steinhardt, Measuring coding challenge
competence with APPS, 2021. arXiv:2105.09938.
[34] R. Bommasani, et al., On the opportunities
and risks of foundation models, arXiv preprint</p>
    </sec>
  </body>
  <back>
    <ref-list />
  </back>
</article>