<!DOCTYPE article PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Archiving and Interchange DTD v1.0 20120330//EN" "JATS-archivearticle1.dtd">
<article xmlns:xlink="http://www.w3.org/1999/xlink">
  <front>
    <journal-meta />
    <article-meta>
      <title-group>
        <article-title>Strengthening the AI Operating Environment</article-title>
      </title-group>
      <contrib-group>
        <contrib contrib-type="author">
          <string-name>Bruce Hedin</string-name>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Samuel Curtis</string-name>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Hedin B Consulting</string-name>
        </contrib>
        <contrib contrib-type="author">
          <string-name>The Future Society</string-name>
        </contrib>
      </contrib-group>
      <abstract>
        <p>In the rapidly evolving discourse on artificial intelligence (AI), the familiar refrain of “maximizing potential while mitigating risks” has become somewhat of a ubiquitous mantra, emphasizing the need for an efective risk mitigation framework. This paper briefly examines the current state of AI-enabled applications and discusses the various risk containment strategies being implemented. Initial eforts focused on establishing high-level principles for responsible AI use. More recent strategies have sought to operationalize these principles through normative instruments, such as industry best practices and legal statutes, that govern AI applications and their creators. While valuable, such a top-down approach is not suficiently efective; a complementary, bottom-up approach focused on strengthening the environment in which AI is deployed is also necessary. The paper analyzes two specific initiatives aimed at enhancing the human component of AI deployment (creating a better-informed public through AI benchmarks, creating a better-equipped public with resources for local validation) and ofers insights on how this environment-focused track can contribute to risk containment. Furthermore, we suggest additional steps for leveraging this approach in tandem with top-down strategies to cultivate a more robust risk mitigation framework.</p>
      </abstract>
      <kwd-group>
        <kwd>eol&gt;AI governance</kwd>
        <kwd>AI education</kwd>
        <kwd>AI risk</kwd>
        <kwd>Benchmarking</kwd>
        <kwd>Evaluation</kwd>
        <kwd>Validation</kwd>
        <kwd>Efectiveness</kwd>
        <kwd>Competence</kwd>
        <kwd>Trust</kwd>
      </kwd-group>
    </article-meta>
  </front>
  <body>
    <sec id="sec-1">
      <title>1. Introduction</title>
      <p>might create; they are, nevertheless, still simply
statistical models of discourse tokens, well short of the capacity
As the use of AI-enabled applications, both in the le- for understanding and creativity characteristic of
gengal domain and elsewhere, has gone from a topic for eral intelligence[1][2]). It is also true, however, that even
academic discussion to a matter of everyday practice, narrow-purpose AI applications (e.g., within the legal
questions about how best to realize the potential of such domain, those designed specifically for judicial decision
applications, and how best to mitigate the risks attendant modeling, predictive policing, or facial recognition) can,
upon their use, have taken front and center in the various if used improperly, jeopardize core social values such as
venues in which the interaction between technology and fairness, subtract from individual privacy, liberty, and
the norms and institutions that govern the life of society dignity, and undermine assumptions about truth-seeking
are discussed. This attention to AI’s potential for both and justice-realization that are the basis for the rule of
good and bad, and to ways of realizing the former while law (and hence for a stable democratic order). These are,
containing the latter, has only been heightened in recent regardless of one’s perspective on the capacity and
implimonths by the release of a range of publicly accessible cations of LLMs, serious risks that call for commensurate
applications that draw on large language models (such eforts at risk containment.
as GPT-4). Eforts at containing the risks attendant upon AI have</p>
      <p>An attention to risks attendant on the use of AI, pro- been under way for some time. Early eforts focused
vided it is grounded in an understanding of AI’s real on articulating high-level, value-oriented, principles for
capabilities and limitations, is salutary. It is true that the responsible design, development, and use of AI (for
the risks, given the current state of the technology, are examples, see: [3][4][5][6][7][8]). Collectively, these
efsometimes overstated (LLMs are indeed robust platforms forts were, if the sheer quantity of principles (or sets
for a range of diferent applications and can generate of principles) proposed is a measure of success, quite
output that closely approximates that which a human successful[9][10][11]. Where these eforts fell short was
in establishing mechanisms connecting the principles to
In: Proceedings of the Third International Workshop on Artificial Intel- actual practice.
ligence and Intelligent Assistance for Legal Professionals in the Digital More recent eforts, seeking to fill this gap, have
foJWunorek1p9la,c2e02(L3,egBarlaAgIaI,APo20rt2u3g),ahl.eld in conjunction with ICAIL 2023, cused on the question of how to operationalize such
prin$ bhedin@hedinb.com (B. Hedin); ciples. The objective of these eforts has generally been
samuel.curtis@thefuturesociety.org (S. Curtis) the creation of normative instruments that would
encour https://thefuturesociety.org/ (S. Curtis) age, or enforce, adherence to the aspirational principles.</p>
      <p>© 2023 Copyright for this paper by its authors. Use permitted under Creative Commons License The forms proposed for such normative instruments have
CPWrEooUrckReshdoinpgs IhStpN:/c1e6u1r3-w-0s.o7r3g ACttEribUutRion W4.0oInrtekrnsahtioonpal (PCCroBYce4.0e).dings (CEUR-WS.org)
varied, from informal industry best practices, to more pre- such a solution. With the perspective gained from this
cise (and auditable) standards, all the way to enforceable discussion, we draw (in Section 5) some general lessons
legal statutes. The object of governance for these nor- about the potential the environment-focused track holds
mative instruments has been primarily AI applications for risk containment.
and their creators: the norms established are intended
to act as “guardrails” on the design and operation of
AIenabled applications, on the objectives and requirements 2. Related work
of developers of applications, on the use cases in which
applications may be deployed, and even on the
structure and conduct of the entities that produce AI-enabled
applications. Notable examples of initiatives on this
topdown track include the creation of government ofices
charged with responsibility for algorithm inspection,
proposals for regulations requiring that AI applications meet
certain design specifications (“privacy by design,”
“human rights by design”), laws requiring the destruction
of training data, restrictions or outright bans on the use
of certain applications (judicial modeling technologies,
facial recognition technologies), and calls for a global
moratorium on the research and development of “strong”
AI.</p>
      <p>This top-down, application-focused, approach to risk
containment is, in at least some of its less heavy-handed
instantiations, a valuable and necessary one. It does not,
however, exhaust the approaches to risk containment
available to policymakers and other stakeholders in the
safe use of AI. Complementary to the application-focused
approach is an approach that starts from a bottom-up
perspective and takes as its objective, not the creation of
guardrails on the development and use of AI, but rather
the strengthening (or “hardening”) of the environment
in which AI-enabled applications are deployed. This
approach seeks to contain risk by making the
environment (in all its components: hardware, software, and
human) in which AI is deployed more resistant to AI
misuse (whether intentional or not) and therefore less
susceptible to the risks attendant on such misuse.</p>
      <p>
        In this paper, we examine more closely the potential
that the bottom-up, environment-focused, track holds as
a means for risk containment. We do so by considering
approaches to strengthening the human component of
the environment in which AI-enabled applications are
deployed. More specifically, we draw attention to two key
gaps in the resources currently available to stakeholders
in the responsible use of AI in the service of the law:
(
        <xref ref-type="bibr" rid="ref1">1</xref>
        ) the absence (discussed in Section 3) of an on-going
program of benchmarks that can provide stakeholders
with meaningful information on the actual capabilities
and limitations of AI-enabled legal applications and (
        <xref ref-type="bibr" rid="ref2">2</xref>
        )
the absence (discussed in Section 4) of resources that
would allow practitioners to conduct their own
evaluations of the efectiveness of AI in real-world settings.
      </p>
      <p>In the case of each gap, we characterize the nature of
the need, identify the features of a solution that would
meet the need, and discuss work done to date toward
This paper ofers a framework (simply put: top-down
vs. bottom-up) for assessing approaches to mitigating
the risks attendant on the use of AI-enabled applications
in the service of the law. There are, of course, other
frameworks that have been ofered and these also can
provide insightful perspectives. Among the initiatives
that are related to, and often complementary to, the work
presented in this paper are the following.</p>
      <p>Guidelines. The Asilomar AI Principles [7] put
forward 23 principles spanning research issues, ethics and
values, and longer-term issues, for the research and
development of AI. European Ethical Charter on the Use
of AI in the Judicial Systems and their Environment [5]
presents five principles, intended for both public and
private stakeholders responsible for the design and
deployment of AI tools and services that involve the
processing of judicial decisions and data, and were adopted
by the European Commission for the Eficiency of Justice
(CEPEJ). The General Principles of Ethically Aligned
Design [4] proposes eight principles upon which ethical and
values-based design, development, and implementation
of autonomous and intelligent systems (including
artificial intelligence and intelligent assistance technologies
designed for legal professionals; see the Chapter on Law)
should be guided. The European Commission’s Ethics
guidelines for trustworthy AI [6], drafted by the
European Commission High-Level Expert Group on AI, puts
forward seven key requirements that AI systems should
meet in order to be deemed trustworthy. The
Partnership on AI has drafted eight tenets [8] that its members,
spanning industry, academia, and non-profit, “endeavor
to uphold.”</p>
      <p>Risk mitigation frameworks focused specifically
on LLMs. Weidinger et al.[12] proposes a
comprehensive taxonomy of ethical and social risks associated
with large-scale language models, identifying
twentyone risks across six risk areas and discussing approaches
to risk mitigation. Bai et al.[13] proposes a method called
Constitutional AI (CAI) to train a non-evasive and
relatively harmless AI assistant without human feedback
labels for harms, with the aim of developing techniques
to create AI systems that adhere to design (or
“constitutional”) principles, as opposed to learning from human
feedback. Mökander et al.[14] proposes a three-layered
approach to auditing large language models, which
includes governance audits, model audits, and application
audits—the third of which includes a component to check
LLMs’ adherence to ethical principles. ample, has developed a MOOC on AI and the Rule of</p>
      <p>Educational eforts. Long and Magerko[15] provides Law[21]; aiEDU, to cite another, is an initiative that
proa concrete definition of AI literacy based on existing motes broad AI literacy through the development of AI
research and synthesizes a variety of interdisciplinary curricula for use in a wide range of educational venues,
literature into a set of core competencies of AI literacy, from K-12 schools to public museums[22].
as well as design considerations to support AI developers As potentially valuable as these educational initiatives
and educators in creating learner-centered AI. Lin and are, they will be successful in meeting their objectives
Van Brummelen[16] presents the findings from work- only insofar as they are able to access and convey
accushops co-designed with K-12 teachers—that scafolding rate and meaningful content. This is where a challenge
in AI tools and curriculum is needed for ethical and data appears: for some topics, namely topics related to the
discussions, learner evaluation, engagement, peer collab- efectiveness of AI, the content is lacking (or at least
lackoration, and critical reflection—and an exemplar lesson ing in the form required for fostering broadly distributed
plan illustrating ways to teach AI in non-computing sub- AI competence). In this section, we examine this gap and
jects within a remote setting. Gašević et al.[17] explores consider an approach to filling it.
the theme of empowering learners for the age of AI and
highlights the need for foundational discussions about 3.1. The need
learning theory and conceptualizations of learning
actions and behaviors in AI-human settings, as well as con- If we wish to foster an informed, and empowered,1 public,
cerns regarding ethics, bias, and fairness in AI’s growing one capable of making empirically well-grounded
deciinfluence. Hugging Face[ 18] has sought to democratize sions about the sorts of tasks to which AI should and
machine learning knowledge and competence by ofering should not be applied, and the conditions that should
educational materials for beginners as well as instructors. be met when it is applied, we need to ensure that the
Hugging Face supported the BigScience open research public has access to accurate information about the
efeccollaboration, which brought together more than 1,000 tiveness of AI (i.e., its capabilities and limitations when
researchers from 60 countries and more than 250 institu- applied to real-world tasks). The problem is that evidence
tions to create BLOOM[19], an openly and transparently of efectiveness of AI-enabled applications is spotty: it
trained multilingual LLM. exists, and is accessible, in only a very incomplete and
inconsistent manner. The reason is that there is no suitably
authoritative institutionalized program for generating
3. Strengthening the AI operating the required evidence in a manner and format that can be
environment through better readily consumed by individuals and civil society groups,
thereby meeting the objective of giving citizens informed
information agency over the use of AI in their (and their fellow
citizens’) lives.</p>
      <p>
        It is also worth noting that, while our focus in this
section is on the means to foster an informed public, the
evidence gap just observed has wider implications. It
acts as roadblock not only to meeting the objective of
an informed public but also to meeting the other
objectives of the principles that have been articulated for the
responsible use of AI (which may be stated,2 at an
abstract level, as (
        <xref ref-type="bibr" rid="ref1">1</xref>
        ) protection of core values, (
        <xref ref-type="bibr" rid="ref2">2</xref>
        ) creation of
conditions needed for an informed trust, and (
        <xref ref-type="bibr" rid="ref3">3</xref>
        )
advancement technological innovation and economic prosperity).
      </p>
      <p>Without sound evidence of efectiveness, we will be
unable to protect core values, because we won’t know (a)
whether the AI-enabled systems achieve their immediate
goals nor (b) whether, even in achieving those immediate</p>
      <sec id="sec-1-1">
        <title>An environment in which those who would use, or be</title>
        <p>afected by, AI-enabled applications lack at least baseline
information about AI (what it is, where it is, how it works,
and how well it works) is an environment conducive to
misuse (not to mention to unhelpful, even harmful, hype).</p>
        <p>Conversely, an environment in which both active and
passive users of AI are well-informed about AI’s use cases,
its conditions of use, and its strengths and weaknesses
is one that will be more resistant to AI misuse (and to
the risks associated with such misuse). An important
component of any efective strategy for containing the
risks associated with AI will therefore be education: if
we can foster a public that is better informed about AI,
we will foster a public better equipped to recognize and
guard against the risks associated with it.</p>
        <p>The role of education in risk containment has been rec- 1Informed, of course, does not necessarily mean empowered.
Adognized for some time. Education was one of the three vancing the empowerment of citizens means not only ensuring
themes of the inaugural edition (2019) of The Athens ltehgaatlcaitnidzepnrsahctaivcaelaccocnedssititoonisnfaorremsuatcihonthbauttcaitliszoenesnscuanrinagcttohnattthhaet
Roundtable on Artificial Intelligence and the Rule of information.</p>
        <p>Law[20]. Education is also the focus of a number of 2This threefold classification scheme for the objectives of principles
current initiatives. The Future Society, to cite one ex- is the authors’ own; other classification schemes can be found in [ 9]
and [10]
• Narrow focus. The research objectives of the
evaluations are often such that they are better
served by narrowly circumscribing the scope of
the exercise, not measuring the impact of the
whole sociotechnical system, of which the
technology is a part, on the values with which the
public may be concerned. Consistent with these
objectives, the evaluations gauge the performance
of the technologies being evaluated using metrics
specifically relevant to the capability addressed in
the study; they do not seek measures that would
provide a comprehensive view of the
technology’s fitness for purpose.
• Distance from real-world circumstances. In
the interest of arriving at a well-controlled
answer to specific research questions, the studies
often do not make allowance for variability in
all the factors that could, in a real-world setting,
afect a system’s efectiveness. The result is an
exercise that is removed from the real-world
circumstances. Moreover, obtaining evaluation data
sets that, in both size and character, are reflective
of the data populations to which the technology
under evaluation would be applied in a real-world
setting is a challenge that current evaluations are
often unable to meet.
goals, they impinge on other core values. We will be
unable to create the conditions of trust, because we
will lack the empirical data that is the basis for a
wellgrounded trust (or distrust) [23]. We will be unable to
advance the goals of technological innovation and
economic prosperity, because we will lack the
information needed to optimize the allocation of research efort
and financing. In terms of approach, moreover, having
access to sound evidence of efectiveness is necessary
for both bottom-up and top-down approaches to risk • Misalignment of purpose. Many currently
containment. With regard to the latter, as illustrated in available evaluations are of the one-of variety:
Figure 1, evidence of efectiveness is necessary both for they are designed to produce just the data needed
the formulation of viable normative instruments and for for the study that occasioned them and they are
the assessment of adherence to the norms instantiated not intended to be repeated on a regular basis. An
in such instruments. In short, evidence of efectiveness additional limitation that is particularly
characis needed both for the general objective of ensuring the teristic of industry white papers is that they are
responsible use of AI-enabled systems and for the specific generally designed, not to provide a well-rounded
objective of fostering an informed public. view of the technology’s fitness for purpose, but</p>
        <p>Now, to say that the required evidence is lack- to highlight characteristics of the enterprise’s
ofing is not to say that there is no evidence at all. fering that, the enterprise believes, will resonate
There is indeed a healthy flow of reports, of var- in the marketplace.
ious types, of evaluations of the efectiveness of
AI-enabled systems. These include: academic re- 3.2. A proposal for meeting the need
search papers[24][25][26]; industry white papers;
reports of government-sponsored evaluations[27][28]; If the objective of an informed (and empowered) public is
evaluations conducted by non-governmental civic a worthy one, and if a lack of evidence of the efectiveness
organizations[29][30][31]; and academic and industry- of AI-enabled systems is impeding the achievement of
sponsored benchmarking initiatives[32][33]. that goal, then what might a solution that removed that</p>
        <p>The problem is that these evaluations, while well- impediment look like? What we propose, and what we
designed to meet their own objectives, have not been discuss in the remainder of this section, is the creation
designed specifically to meet the objective of fostering of an on-going institutionalized program of
interoperaa general public that is informed and empowered. As ble open AI benchmarks, the purpose of which would
a result, the evidence the studies generate is lacking in be to supply the empirical evidence needed to foster a
key features required to meet that objective. Among key public empowered to make informed decisions about the
limitations of current evaluations3 are the following. use of AI-enabled technologies. The benchmarks should
be “open” in the sense that exercises must be
transpar</p>
      </sec>
      <sec id="sec-1-2">
        <title>3We are, of course, not of saying that all currently available evalua</title>
        <p>tions are subject to all of these limitations. We are saying simply
that each evaluation is subject to at least one of them.
ent: data used, procedures followed, and results
generated must all be open to inspection (or, in some cases,
audit), by both participants and independent observers.
They should be “interoperable” in the sense that they
will supply evidence usable by all regulatory regimes,
regardless of the specific goals and priorities that are
operative within any specific jurisdiction. Furthermore,
if they are to serve their intended purpose of fostering a
better-informed public, they should generate results that
can be understood by both experts and non-experts.</p>
        <sec id="sec-1-2-1">
          <title>3.2.1. Requirements</title>
          <p>
            A benchmarking program that will meet the general
objective of fostering an informed public (a public that
includes everyone from researchers and designers, to
policymakers and lawyers, all the way to the potentially
involuntary decision subjects of judicial or enforcement
technologies) will have to meet certain requirements. It
must: (
            <xref ref-type="bibr" rid="ref1">1</xref>
            ) design evaluations that model real-world
circumstances; (
            <xref ref-type="bibr" rid="ref2">2</xref>
            ) generate results that will be meaningful,
and actionable, for a wide range of stakeholders; (
            <xref ref-type="bibr" rid="ref3">3</xref>
            ) run
evaluation exercises that are consistent and trusted; and
(
            <xref ref-type="bibr" rid="ref4">4</xref>
            ) be practically viable. Specific implications of these
basic requirements are the following.
enough to allow informative comparison from
one run of an exercise to the next), (
            <xref ref-type="bibr" rid="ref3">3</xref>
            ) that the
program should be institutionalized (i.e., have the
legitimacy and durability that come from
sponsorship by recognized public authorities), and (
            <xref ref-type="bibr" rid="ref4">4</xref>
            )
that the design and execution of the evaluations
run in the program be transparent (data used,
procedures followed, and results generated must all
be open to inspection by both participants and
independent observers).
• Practical. In order to be viable, the program
must also meet a number of non-trivial practical
requirements. These include (
            <xref ref-type="bibr" rid="ref1">1</xref>
            ) reaching
consensus on metrics for concepts and tasks where that
consensus is currently elusive[34], (
            <xref ref-type="bibr" rid="ref2">2</xref>
            ) obtaining
fresh and meaningful data sets on a regular basis,
(
            <xref ref-type="bibr" rid="ref3">3</xref>
            ) achieving broad participation (which means
having low barriers to entry, in terms of both
cost and reputational risk), and (
            <xref ref-type="bibr" rid="ref4">4</xref>
            ) producing its
results in a timely and eficient manner.
          </p>
          <p>
            Meeting the requirements and challenges on this list
will not be a trivial undertaking. Fortunately, those who
would create a benchmarking program aligned with this
vision are not without resources upon which to draw.
• Real-world. In order to be relevant to real-world As we have already seen, researchers have been
designpractice, it is essential that the evaluations con- ing and conducting evaluations of AI-enabled systems
ducted in the benchmarking program (
            <xref ref-type="bibr" rid="ref1">1</xref>
            ) closely for many years. While those evaluations have not been
model real-world conditions and objectives and designed for the same purposes as those that would be
(
            <xref ref-type="bibr" rid="ref2">2</xref>
            ) take as the target of their measurement the run in the proposed benchmarking program, they can
whole system of which the AI-enabled technol- still serve as a valuable resource for those seeking to
ogy is a part. A failure to do so would be a failure address the requirements and challenges of a
meaningto provide the evidence actually required by the ful AI benchmarking program. A few examples of such
public. resources are the following.4
• Meaningful. In order to be actionable, it is
essential that the results generated by the
benchmarking program (
            <xref ref-type="bibr" rid="ref1">1</xref>
            ) be expressed via meaningful
metrics and (
            <xref ref-type="bibr" rid="ref2">2</xref>
            ) be interoperable across national
and other jurisdictional boundaries. With regard
to metrics, “meaningful” means that they should
be (
            <xref ref-type="bibr" rid="ref1">1</xref>
            ) statistically sound, (
            <xref ref-type="bibr" rid="ref2">2</xref>
            ) relevant, and (
            <xref ref-type="bibr" rid="ref3">3</xref>
            ) and
understandable both to experts and non-experts.
          </p>
          <p>
            The interoperability requirement means that the
results of the exercises should be broadly usable,
providing information that can be acted upon
regardless of the specific goals and priorities that
are operative within any specific jurisdiction.
• Consistent and trusted. If the public is to rely
upon the results produced by a benchmarking
program, the results must be generated in a
consistent and trusted manner. This means, specifically,
(
            <xref ref-type="bibr" rid="ref1">1</xref>
            ) that the evaluations should be run on a
periodic schedule, (
            <xref ref-type="bibr" rid="ref2">2</xref>
            ) that the evaluations should be of
a reasonably consistent design (at least consistent
• The series of studies conducted in the
          </p>
          <p>NIST-sponsored Text Retrieval Conference
(TREC)[27].5
• The HELM (Holistic Evaluation of Language
Models) initiative undertaken by Stanford’s Center for</p>
          <p>Research on Foundation Models [33].
• METRICS – An international competition for the</p>
          <p>evaluation of robotics and AI[36].
• NIST’s 2021 AI Measurement and Evaluation</p>
          <p>Workshop[37].</p>
        </sec>
      </sec>
      <sec id="sec-1-3">
        <title>4It is worth emphasizing that this is simply a selection of examples,</title>
        <p>not an exhaustive list of available resources.
5We note that the series of evaluations conducted in the TREC Legal
Track from 2006 through 2011[35], which produced data on the
efectiveness of advanced technologies at the task of legal
discovery, and thereby provided empirical grounding for decisions as to
whether to adopt those technologies (or, in the case of courts, to
allow their adoption), illustrates the potential that a well-designed
on-going program of benchmarks could hold for creating a
betterinformed public.
• A framework developed by the AI Ethics Impact</p>
        <p>Group for operationalizing AI ethics[38].
• Academic papers published in the proceedings of</p>
        <p>AI-focused conferences[25][26].</p>
      </sec>
      <sec id="sec-1-4">
        <title>The above is of course not an exhaustive list of available resources; it is intended simply to illustrate the sorts of resources may build upon in developing a benchmarking program that would meet the need we have identified.</title>
        <sec id="sec-1-4-1">
          <title>3.2.2. Benefits</title>
          <p>While designing and implementing a program that meets
the requirements we have identified would be a challenge,
the benefits of meeting that challenge are significant and
tangible. By filling the evidence gap, the program would
help to foster a public that was better informed about
the real capabilities, limitations, and risks of AI-enabled
systems (including those drawing upon LLMs). It would
do so both directly, insofar its results were consumed by
members of the public without the mediation of other
entitiies, and indirectly, insofar as its results reached
the public through the mediation of civil society groups
or educational initiatives focused on questions of
society and technology. A better informed public would, in
turn, be one better positioned to recognize, and address,
risks to core human values, to protect the liberty, privacy,
and dignity of the individual, to resist the temptation
of unwarranted hopes or fears about AI, and to support
measures that further, in a responsible manner, scientific
innovation and economic prosperity.</p>
          <p>
            Apart from these primary benefits, such a program
would also bring a number of collateral benefits. These
include: (
            <xref ref-type="bibr" rid="ref1">1</xref>
            ) thanks to its provision of empirically sound
and readily understandable evaluations of efectiveness,
providing policymakers and regulators with the basis for
evidence-based decision making, (
            <xref ref-type="bibr" rid="ref2">2</xref>
            ) thanks to its meeting
the interoperability requirement, fostering international
cooperation, and (
            <xref ref-type="bibr" rid="ref3">3</xref>
            ) thanks to its addressing the
challenges of defining and obtaining metrics for complex
concepts and goals, advancing consensus around metrics
and evaluation design.
          </p>
        </sec>
        <sec id="sec-1-4-2">
          <title>3.2.3. Action to date</title>
        </sec>
      </sec>
      <sec id="sec-1-5">
        <title>In recognition of both the benefits and the challenges of</title>
        <p>developing a benchmarking program that would meet
the requirements we have identified, preliminary work
has begun on the design and implementation of such a
program. More specifically, under the auspices of the
IEEE and The Future Society, a working group has been
formed to explore the advisability and feasibility of
pursuing such a project. The group includes representation
from key agencies on both sides of the Atlantic. To date,
the group has reached agreement on the need and the
outlines of a program that would meet the need. Its
current focus is on exploring practical questions related to
how such a program should be developed. The group has
not yet set a timetable for reporting on the results of its
exploratory work.</p>
      </sec>
    </sec>
    <sec id="sec-2">
      <title>4. Strengthening the AI operating environment through better tools</title>
      <p>In the previous section, we considered a proposal aimed
at strengthening the human environment in which AI
is deployed through the fostering of a better-informed
public. More specifcally, the proposal seeks to create a
better-informed public through the establishment of an
institutionalized program of open and interoperable AI
benchmarking evaluations which have been designed to
gather and publish sound evidence regarding the
capabilities and limitations of AI-enabled systems when applied
in real-world circumstances.</p>
      <p>The evidence generated by benchmarking evaluations
is a key input to a sound assessment of the
trustworthiness of a technology. A well-designed benchmark (one
accurately modeling real-world conditions, using data
sets representative of those likely to be encountered in
the actual application of a technology, and quantifying
the various aspects of efectiveness through meaningful
metrics) can tell us what we can reasonably expect (in
terms of both capabilities and limitations) from a given
technology in a given circumstance. That expectation can
then be used to decide whether we have a plausible basis
for trusting the technology to perform the task we are
asking of it. The evidence generated by a benchmarking
evaluation cannot, however, tell us whether the
technology in question, once it has been applied, has in fact met
its objectives in the specific circumstance in which we
have applied it. If we want that information, we need to
turn to local validation.</p>
      <p>The results generated by a local validation exercise
(a real-time or after-the-fact test of the efectiveness
achieved by a given technology in a specific
circumstance) are complementary to those generated by
benchmarking evaluations. The latter tell us whether we have
empirical grounds for believing that a technology of a
given class will be successful in circumstances broadly
similar to those modeled in the benchmark; the former
tell us whether we have empirical grounds for
believing that a specific instance of a technological system
was successful in the specific circumstances in which we
did apply it (specific data, specific hardware conditions,
specific operators, specific timetables, and so on). Both
questions are relevant in assessing the trustworthiness of
a technology. The general question (answered by
benchmarking evaluations) is most relevant before application, those same competencies must be much more widely
when we are deciding whether to adopt the technology distributed. Individuals at a geographically very broad
for a given task. The specific question (answered by local range of sites of AI deployment will need to be supplied
validation) is most relevant after (or during) application, with the competencies required to run meaningful tests
when we are deciding whether to trust the results that of the technology as it has been deployed at their sites
have actually been generated by the technology. Having and in their specific circumstances. Meeting this need
reliable answers to both questions is essential to putting does not mean that every member of the public has to
the adoption and use of advanced technologies in the be equipped with the competencies required to design
service of the law on an empirically sound footing. and run evaluations; it will sufice to widen the circle of</p>
      <p>The complementary relationship between the two competence to a broader range of domain experts. This
types of inquiry can be illustrated with an example taken is still a challenge however: how do we bring about a
disfrom legal discovery in the US. The evaluations conducted tribution of the required competencies that is suficiently
in the TREC Legal Track (2006-2011)[35] produced results broad to answer the need for local validation?7
that showed that advanced retrieval technologies (often
termed “technology-assisted review” or “TAR”) could be 4.2. A proposal for meeting the need
reasonably efective at performing the task of
retrieving documents responsive to a request for production.6 If the objective of fostering a user pool better-equipped
That evidence gave responding parties the empirical ba- to gather evidence of the efectiveness of AI-enabled
syssis they needed to adopt some variety of that class of tems at the site of deployment is a worthy one, and if
technologies as the means to meet their discovery obliga- achieving that objective means bringing about a wider
tions (and, importantly, gave courts the empirical basis distribution of the competencies required to design and
they needed to license that adoption). That evidence did run sound local validation exercises, what might a
solunot, however, obviate the need for local validation of tion that enabled that distribution look like? What we
the results generated by a given technology in a given propose, and what we discuss in the remainder of this
matter. Requesting parties, and courts, still expect the section, is the creation of a repository of resources that
circumstance-specific, after-the-fact, results that come can be accessed by operators seeking guidance on how
only from local validation (and these expectations are to design and run local validation exercises.
often encoded in ESI (“electronically stored information”)
protocols which govern discovery procedures in a given 4.2.1. Requirements
matter). The general (TREC) evaluations provided the
plausibility that gave the green light for adoption, but
the matter-specific (local) evaluations are still needed to
provide the evidence that establishes the soundness of
the actual results.</p>
      <sec id="sec-2-1">
        <title>If we wish to provide domain experts and operators with</title>
        <p>the resources needed to conduct local testing of the
systems they are overseeing, the resources we make
available to them must meet a number of requirements. Chief
among these are the following.
4.1. The need
If local validation is an important element in an
assessment of the trustworthiness of a technology, then there
is a need to bring about the conditions needed to ensure
that sound local validation exercises can be conducted
often and everywhere. Here, however, there is a
challenge. Whereas, in the case of benchmarking evaluations,
the competencies required to design and run
meaningful and statistically sound tests of the efectiveness of
a technology can reside in a relatively small number of
individuals (the individuals organizing and running the
benchmarking program), in the case of local validations,</p>
      </sec>
      <sec id="sec-2-2">
        <title>6The choice of modal is important here: the studies showed that TAR</title>
        <p>could achieve reasonably high levels of recall and precision; they did
not show that TAR would, in all its instantiations and in all
circumstances, achieve those results. Hence the need for local validation.
This point is also sometimes insuficiently appreciated by readers
of [39], which analyzed that showed that TAR can be superior to
manual review (not that it will be superior in all circumstances).
• Application-specific. The testing that is
required will vary from application to application.
What is required for the local validation of an
instance of TAR applied to the task of
discovery, for example, will difer from that which is
required for the local validation of a risk-assessment
technology applied to custody decisions. The
resources must therefore be application-specific
and the ultimate goal should be the creation of
a “library” of resources, each of which is tailored
(in terms of test design, metrics, sampling
procedures, interpretive guidance, and so on) to a
specific task to which an AI-enabled system may
be applied.</p>
      </sec>
      <sec id="sec-2-3">
        <title>7Of course, there will not be a need for local validation for every</title>
        <p>deployment of AI, but even restricting to deployments of sensitive
applications, and even allowing for some level of aggregate testing
of deployed technologies, there will still be a need for achieving
a much wider distribution of the required competencies than we
have today.</p>
      </sec>
      <sec id="sec-2-4">
        <title>The creation of a repository of resources like that pro</title>
        <p>posed in this section is no small undertaking; realizing
the vision will require input from experts from a wide
range of disciplines and subject-matter areas. The
beneifts of such a repository, however, would be considerable.</p>
        <p>These include:
• Tutorial and procedural content. The re- 4.2.2. Benefits
sources should provide not only a procedural
“recipe” for conducting a test, but should also
provide suficient tutorial content to enable an
operator to understand the motivation behind a
given procedural step (what a given term-of-art
means, why a given metric is being used, why a
given sampling design is chosen, and so on). To
be efective, these resources should be calibrated
for users with intermediate levels of expertise in
the use and testing of advanced legal
technologies. They need not be at the level of academic
research papers, but they do have to go beyond
elementary introductions.
• Intended audience. The resources should be
carefully calibrated to the level of expertise of
their intended audience. Those who will be
responsible for conducting local validation
exercise will be a smaller, and technically more
advanced, group than those consuming the results
of those evaluations. The resources be calibrated
to meet the requirements of these more expert
users (while, to the extent possible remaining
within the grasp, at least at a high level, of
nonexpert users).
• Adaptable. Even with the boundaries of a
specific domain and task, there will be considerable 4.2.3. Action to date
circumstance-specific variation from one
deployment of a system to another. The guidance pro- The repository we have proposed remains, at the moment,
vided by the resources should be of a suficient aspirational; there is as yet no program under way to
depth to enable an operator to adapt the specified create it. Work has begun, however, on creating materials
procedures for use in the specific circumstances that would meet the requirements specified for resources
at hand. in the repository and that could serve as a model for other
resources.</p>
        <p>More specifically, under the auspices of the IEEE and
The Future Society, a project has been initiated, and in
fact is nearing completion, to create a set of resources
that, in the specific domain of legal discovery, will enable
practitioners to conduct meaningful local validation of
the results of applying advanced review technologies (or,
for that matter, to the results of applying any review
technology) to the task of legal discovery. The specific
materials we have drafted are the following.</p>
        <p>• Improved competence;
• Improved efectiveness;
• Strengthened trust;
• Improved risk containment;
• More broadly distributed agency.
• Direction to other resources. As a practical
matter, the resources cannot cover every
circumstance likely to be encountered in the real-world.</p>
        <p>While they should be of suficient depth to cover
the most common circumstances, they should
provide direction to additional resources (including
human resources) to consult when less typical
circumstances are encountered.</p>
      </sec>
      <sec id="sec-2-5">
        <title>What we have listed above are general requirements</title>
        <p>that any resource must meet if it is to serve the purpose of
distributing the competencies needed to enable more
frequent and efective local validation of AI-enabled systems.
What we have not specified, however is any particular
format for the resources. That is by design. There are, in
fact, a range of diferent formats such resources might
take (written procedures, glossaries, handbooks, video
tutorials online calculators, and so on), and which format
will be most efective will vary from one domain (and
audience) to the next. We therefore leave the specific
format as a question to be decided at the implementation
stage.
• A Model Protocol. An adaptable model ESI
protocol that addresses the key issues that currently
trouble parties in the discovery phase of
litigation. The Protocol focuses on gathering the
evidence needed to have an informed trust in the
results of a review; its provisions are shaped by
the principles of proportionality and
evidencebased decision-making.
• A Commentary. A line-by-line commentary on
the Protocol. The Commentary is designed to
provide justification, interpretive guidance, and
tutorial background for the Protocol’s provisions.
• A Handbook for Practitioners. A companion
document that provides an expanded discussion
of the sampling and measurement procedures
specified in the Protocol. The Handbook is
intended to serve as a resource for advanced
practitioners (and other stakeholders) seeking a deeper
and more detailed understanding of the required
statistical procedures.
more nuanced and domain-specific approaches
to risk containment; and
• By distributing knowledge more broadly
(whether that distribution is direct or mediated
by other entities or initiatives) advance the
empowerment of the individual (against both
private and state actors).</p>
        <p>Given these benefits, we think that policymakers, and
other stakeholders engaged in advancing the
responsible use of AI, should always maintain an
environmentfocused (or bottom-up) track as a complement to the
application-focused (or top-down) track. In fact, given
the more benign collateral implications of
environmentfocused approaches, they should often be viewed as the
solution of first recourse.</p>
        <p>These materials have been drafted and are currently
being reviewed by a group of experts with a range of
diferent perspectives on the use of advanced
technologies for legal discovery and on how to put that use on
the basis of an informed trust. We plan to publish the
materials in 2023. Our hope is that the materials will
both serve their immediate purpose of putting the use
and testing of e-discovery technologies on a sounder
footing and serve the larger purpose of serving as a model
for resources that will enable the wider distribution of
the competencies needed to conduct local validation of
AI-enabled systems in other domains.</p>
      </sec>
    </sec>
    <sec id="sec-3">
      <title>5. Concluding remarks</title>
      <p>In this paper, we have drawn attention to the
environment in which AI-enabled systems are deployed as a key
element in any strategy for containing the risks (and for Acknowledgments
realizing the potential) attendant on the use of such
systems. We have focused, more specifically, on the human We would like to thank the organizers of 3rd
Internacomponent of the environment and considered two ap- tional Workshop on Artificial Intelligence and Intelligent
proaches (generating better information about AI’s real Assistance for Legal Professionals in the Digital
Workcapabilities and limitations, creating tools that will en- place (LegalAIIA 2023) for providing a forum at which
able practioners to conduct local validation of the results we could express our views and hear those of others
of AI-enabled applications) for strengthening that com- interested in these topics.
ponent against risk. There are, of course, other aspects
of the environment in which AI-enabled systems are
deployed (hardware, software, even legal and financial) References
and exploration of ways to strengthen those components
(making then more conducive to the detection, reporting,
and resolution of risks) could pay of in more efective
or eficient approaches to risk containment. One
practical example is creating readily accessible pathways and
repositories that would allow users (especially
betterinformed and better-equipped users) to report anomalies
they have observed and to compare their observations
with those submitted by others[4].</p>
      <p>As can be seen by reviewing the requirements
specified for the two proposals we have considered, the
work required to strengthen the environment against
AI-associated risk is non-trivial. To be successful,
approaches on the environment-focused track require a
considerable amount of planning, coordination, and
effort. The benefits these approaches bring, however, are
significant. Evironment-focused approaches may:
• By distributing more broadly the means for
identifying and responding to unwanted outcomes
from AI-enabled applications, avoid some of the
adverse efects on innovation and
technological development that may be occasioned by
topdown approaches;
• By allowing practitioners to tailor solutions to</p>
      <p>their particular objectives and conditions, enable
[8] Partnership on AI, PAI Tenets, 2016. URL: https: Stanford University (2006).</p>
      <p>//partnershiponai.org/about/#tenets. [24] IJCAI, Artificial Intelligence, 2023. URL:
[9] J. Fjeld, N. Achten, H. Hilligoss, A. Nagy, M. Sriku- https://www.sciencedirect.com/journal/
mar, Principled artificial intelligence: Mapping artificial-intelligence.
consensus in ethical and rights-based approaches [25] AAAI, Association for the Advancement of
Artifito principles for ai, Berkman Klein Center Research cial Intelligence, 2023. URL: https://www.aaai.org/.</p>
      <p>Publication (2020). [26] IAAIL, International Conference on Artificial
Intel[10] T. Hagendorf, The ethics of AI ethics: An evalua- ligence and Law (ICAIL), 2023. URL: http://www.
tion of guidelines, Minds and machines 30 (2020) iaail.org/.</p>
      <p>99–120. [27] NIST, Text REtrieval Conference (TREC), 2023. URL:
[11] Y. Zeng, E. Lu, C. Huangfu, Linking artificial intelli- https://trec.nist.gov/.</p>
      <p>gence principles, 2018. arXiv:1812.04814. [28] Ministere de la Justice, Communique du Ministere
[12] L. Weidinger, J. Uesato, M. Rauh, C. Grifin, P.-S. de la Justice et de la premiere presidence de la cour
Huang, J. Mellor, A. Glaese, M. Cheng, B. Balle, d’appel de Rennes, 2017.</p>
      <p>A. Kasirzadeh, et al., Taxonomy of risks posed by [29] World Economoic Forum, Responsible Limits on
language models, in: 2022 ACM Conference on Facial Recognition; Use Case: Flow Management;
Fairness, Accountability, and Transparency, 2022, Part II: Pilot phase: Self-assessment, the audit
manpp. 214–229. agement system and certification, 2020. URL: https:
[13] Y. Bai, S. Kadavath, S. Kundu, A. Askell, J. Kernion, //www3.weforum.org/docs/WEF_Responsible_
A. Jones, A. Chen, A. Goldie, A. Mirhoseini, C. McK- Limits_on_Facial_Recognition_2020.pdf .
innon, et al., Constitutional ai: Harmlessness from [30] Snow, J, Amazon’s Face Recognition Falsely
ai feedback, arXiv preprint arXiv:2212.08073 (2022). Matched 28 Members of Congress with
[14] J. Mökander, J. Schuett, H. R. Kirk, L. Floridi, Au- Mugshots, 2018. URL: https://www.aclu.org/blog/
diting large language models: a three-layered ap- privacy-technology/surveillance-technologies/
proach, AI and Ethics (2023) 1–31. amazons-face-recognition-falsely-matched-28.
[15] D. Long, B. Magerko, What is ai literacy? compe- [31] C. Garvie, A. Bedoya, J. Frankle, The perpetual
tencies and design considerations, in: Proceedings line-up, Georgetown Law Center on Privacy &amp;
of the 2020 CHI conference on human factors in Technology 18 (2016).</p>
      <p>computing systems, 2020, pp. 1–16. [32] NIST, Face Recognition Vendor Test (FRVT), 2023.
[16] P. Lin, J. Van Brummelen, Engaging teachers to URL: https://www.nist.gov/programs-projects/
co-design integrated ai curriculum for k-12 class- face-recognition-vendor-test-frvt-ongoing.
rooms, in: Proceedings of the 2021 CHI conference [33] Center for Research on Foundation Models, Holistic
on human factors in computing systems, 2021, pp. Evaluation of Language Models (HELM), 2023. URL:
1–12. https://crfm.stanford.edu/helm/latest/.
[17] D. Gašević, G. Siemens, S. Sadiq, Empowering learn- [34] NIST, AI Measurement and Evaluation
ers for the age of artificial intelligence, Computers Panel on Measuring Concepts that are
Comand Education: Artificial Intelligence (2023) 100130. plex, Contextual, and Abstract, 2021. URL:
[18] Lepercq, V, Introducing Education, 2022. URL: https: https://www.nist.gov/news-events/events/2021/
//huggingface.co/blog/education. 06/ai-measurement-and-evaluation-workshop.
[19] Hugging Face, Introducing The World’s Largest [35] TREC, TREC Legal Track, 2011. URL: https://
Open Multilingual Language Model: BLOOM, trec-legal.umiacs.umd.edu/.
2023. URL: https://bigscience.huggingface.co/blog/ [36] LNE, METRICS - An international competition for
bloom. the evaluation of robotics and AI, 2023. URL: https:
[20] The Athens Roundtable, The Athens Roundtable: //metricsproject.eu.</p>
      <p>Artificial Intelligence and the Rule of Law, [37] NIST, AI Measurement and Evaluation, 2021. URL:
2019. URL: https://www.aiathens.org/dialogue/ https://www.nist.gov/news-events/events/2021/
ifrst-edition. 06/ai-measurement-and-evaluation-workshop.
[21] The Future Society, MOOC on AI and the Rule [38] S. Hallensleben, C. Hustedt, From principles to
of Law, 2022. URL: https://thefuturesociety.org/ practice: An interdisciplinary framework to
opera2022/05/12/mooc-on-ai-and-the-rule-of-law\ tionalise AI ethics, Bertelsmann Stiftung, 2020.
-successful-completion-of-the-pilot-phase/. [39] M. R. Grossman, G. V. Cormack,
Technology[22] aiEDU, aiEDU: The AI Education Project, 2023. URL: assisted review in e-discovery can be more efective
https://www.aiedu.org. and more eficient than exhaustive manual review,
[23] C. McLeod, E. N. Zalta, Trust in stanford encyclo- Rich. JL &amp; Tech. 17 (2010) 1.
pedia of philosophy, Metaphysics Research Lab,</p>
    </sec>
  </body>
  <back>
    <ref-list>
      <ref id="ref1">
        <mixed-citation>
          [1]
          <string-name>
            <given-names>M.</given-names>
            <surname>Shanahan</surname>
          </string-name>
          ,
          <article-title>Talking about large language models</article-title>
          ,
          <source>arXiv preprint arXiv:2212.03551</source>
          (
          <year>2022</year>
          ).
        </mixed-citation>
      </ref>
      <ref id="ref2">
        <mixed-citation>
          [2]
          <string-name>
            <given-names>E. M.</given-names>
            <surname>Bender</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Koller</surname>
          </string-name>
          , Climbing towards NLU:
          <article-title>On meaning, form, and understanding in the age of data, in: Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics, Association for Computational Linguistics</article-title>
          , Online,
          <year>2020</year>
          , pp.
          <fpage>5185</fpage>
          -
          <lpage>5198</lpage>
          . URL: https://aclanthology.org/
          <year>2020</year>
          .acl-main.
          <volume>463</volume>
          . doi:
          <volume>10</volume>
          . 18653/v1/
          <year>2020</year>
          .acl-main.
          <volume>463</volume>
          .
        </mixed-citation>
      </ref>
      <ref id="ref3">
        <mixed-citation>
          [3]
          <string-name>
            <given-names>OECD</given-names>
            ,
            <surname>Principles on</surname>
          </string-name>
          <string-name>
            <surname>AI</surname>
          </string-name>
          ,
          <year>2019</year>
          . URL: https://oecd.ai/ en/ai-principles.
        </mixed-citation>
      </ref>
      <ref id="ref4">
        <mixed-citation>
          [4] IEEE,
          <source>Ethically Aligned Design, Version</source>
          <volume>1</volume>
          ,
          <year>2019</year>
          . URL: https://standards.ieee.org/ industry-connections/ec/ead1e-infographic/.
        </mixed-citation>
      </ref>
      <ref id="ref5">
        <mixed-citation>
          [5] Council of Europe,
          <source>European Ethical Charter on the Use of Artificial Intelligence in Judicial Systems and their Environment</source>
          ,
          <year>2018</year>
          . URL: https://rm.coe.
          <article-title>int/ ethical-charter-en-for-</article-title>
          <string-name>
            <surname>publication-</surname>
          </string-name>
          4
          <string-name>
            <surname>-</surname>
          </string-name>
          december-2018
          <source>/ 16808f699c.</source>
        </mixed-citation>
      </ref>
      <ref id="ref6">
        <mixed-citation>
          [6]
          <string-name>
            <given-names>European</given-names>
            <surname>Commission</surname>
          </string-name>
          ,
          <source>Ethics Guidelines for Trustworthy AI</source>
          ,
          <year>2019</year>
          . URL: https://digital-strategy.ec.europa.eu/en/library/ ethics
          <article-title>-guidelines-trustworthy-ai.</article-title>
        </mixed-citation>
      </ref>
      <ref id="ref7">
        <mixed-citation>
          [7] Future of Life,
          <source>Asilomar AI Principles</source>
          ,
          <year>2017</year>
          . URL: https://futureoflife.org/
          <year>2017</year>
          /08/11/ai-principles/.
        </mixed-citation>
      </ref>
    </ref-list>
  </back>
</article>