1. Introduction

Strengthening the AI Operating Environment

Bruce Hedin

Samuel Curtis

Hedin B Consulting

The Future Society

In the rapidly evolving discourse on artificial intelligence (AI), the familiar refrain of “maximizing potential while mitigating risks” has become somewhat of a ubiquitous mantra, emphasizing the need for an efective risk mitigation framework. This paper briefly examines the current state of AI-enabled applications and discusses the various risk containment strategies being implemented. Initial eforts focused on establishing high-level principles for responsible AI use. More recent strategies have sought to operationalize these principles through normative instruments, such as industry best practices and legal statutes, that govern AI applications and their creators. While valuable, such a top-down approach is not suficiently efective; a complementary, bottom-up approach focused on strengthening the environment in which AI is deployed is also necessary. The paper analyzes two specific initiatives aimed at enhancing the human component of AI deployment (creating a better-informed public through AI benchmarks, creating a better-equipped public with resources for local validation) and ofers insights on how this environment-focused track can contribute to risk containment. Furthermore, we suggest additional steps for leveraging this approach in tandem with top-down strategies to cultivate a more robust risk mitigation framework.

eol>AI governance AI education AI risk Benchmarking Evaluation Validation Efectiveness Competence Trust

1. Introduction

might create; they are, nevertheless, still simply statistical models of discourse tokens, well short of the capacity As the use of AI-enabled applications, both in the le- for understanding and creativity characteristic of gengal domain and elsewhere, has gone from a topic for eral intelligence[1][2]). It is also true, however, that even academic discussion to a matter of everyday practice, narrow-purpose AI applications (e.g., within the legal questions about how best to realize the potential of such domain, those designed specifically for judicial decision applications, and how best to mitigate the risks attendant modeling, predictive policing, or facial recognition) can, upon their use, have taken front and center in the various if used improperly, jeopardize core social values such as venues in which the interaction between technology and fairness, subtract from individual privacy, liberty, and the norms and institutions that govern the life of society dignity, and undermine assumptions about truth-seeking are discussed. This attention to AI’s potential for both and justice-realization that are the basis for the rule of good and bad, and to ways of realizing the former while law (and hence for a stable democratic order). These are, containing the latter, has only been heightened in recent regardless of one’s perspective on the capacity and implimonths by the release of a range of publicly accessible cations of LLMs, serious risks that call for commensurate applications that draw on large language models (such eforts at risk containment. as GPT-4). Eforts at containing the risks attendant upon AI have

An attention to risks attendant on the use of AI, pro- been under way for some time. Early eforts focused vided it is grounded in an understanding of AI’s real on articulating high-level, value-oriented, principles for capabilities and limitations, is salutary. It is true that the responsible design, development, and use of AI (for the risks, given the current state of the technology, are examples, see: [3][4][5][6][7][8]). Collectively, these efsometimes overstated (LLMs are indeed robust platforms forts were, if the sheer quantity of principles (or sets for a range of diferent applications and can generate of principles) proposed is a measure of success, quite output that closely approximates that which a human successful[9][10][11]. Where these eforts fell short was in establishing mechanisms connecting the principles to In: Proceedings of the Third International Workshop on Artificial Intel- actual practice. ligence and Intelligent Assistance for Legal Professionals in the Digital More recent eforts, seeking to fill this gap, have foJWunorek1p9la,c2e02(L3,egBarlaAgIaI,APo20rt2u3g),ahl.eld in conjunction with ICAIL 2023, cused on the question of how to operationalize such prin$ bhedin@hedinb.com (B. Hedin); ciples. The objective of these eforts has generally been samuel.curtis@thefuturesociety.org (S. Curtis) the creation of normative instruments that would encour https://thefuturesociety.org/ (S. Curtis) age, or enforce, adherence to the aspirational principles.

© 2023 Copyright for this paper by its authors. Use permitted under Creative Commons License The forms proposed for such normative instruments have CPWrEooUrckReshdoinpgs IhStpN:/c1e6u1r3-w-0s.o7r3g ACttEribUutRion W4.0oInrtekrnsahtioonpal (PCCroBYce4.0e).dings (CEUR-WS.org) varied, from informal industry best practices, to more pre- such a solution. With the perspective gained from this cise (and auditable) standards, all the way to enforceable discussion, we draw (in Section 5) some general lessons legal statutes. The object of governance for these nor- about the potential the environment-focused track holds mative instruments has been primarily AI applications for risk containment. and their creators: the norms established are intended to act as “guardrails” on the design and operation of AIenabled applications, on the objectives and requirements 2. Related work of developers of applications, on the use cases in which applications may be deployed, and even on the structure and conduct of the entities that produce AI-enabled applications. Notable examples of initiatives on this topdown track include the creation of government ofices charged with responsibility for algorithm inspection, proposals for regulations requiring that AI applications meet certain design specifications (“privacy by design,” “human rights by design”), laws requiring the destruction of training data, restrictions or outright bans on the use of certain applications (judicial modeling technologies, facial recognition technologies), and calls for a global moratorium on the research and development of “strong” AI.

This top-down, application-focused, approach to risk containment is, in at least some of its less heavy-handed instantiations, a valuable and necessary one. It does not, however, exhaust the approaches to risk containment available to policymakers and other stakeholders in the safe use of AI. Complementary to the application-focused approach is an approach that starts from a bottom-up perspective and takes as its objective, not the creation of guardrails on the development and use of AI, but rather the strengthening (or “hardening”) of the environment in which AI-enabled applications are deployed. This approach seeks to contain risk by making the environment (in all its components: hardware, software, and human) in which AI is deployed more resistant to AI misuse (whether intentional or not) and therefore less susceptible to the risks attendant on such misuse.

In this paper, we examine more closely the potential that the bottom-up, environment-focused, track holds as a means for risk containment. We do so by considering approaches to strengthening the human component of the environment in which AI-enabled applications are deployed. More specifically, we draw attention to two key gaps in the resources currently available to stakeholders in the responsible use of AI in the service of the law: ( 1 ) the absence (discussed in Section 3) of an on-going program of benchmarks that can provide stakeholders with meaningful information on the actual capabilities and limitations of AI-enabled legal applications and ( 2 ) the absence (discussed in Section 4) of resources that would allow practitioners to conduct their own evaluations of the efectiveness of AI in real-world settings.

In the case of each gap, we characterize the nature of the need, identify the features of a solution that would meet the need, and discuss work done to date toward This paper ofers a framework (simply put: top-down vs. bottom-up) for assessing approaches to mitigating the risks attendant on the use of AI-enabled applications in the service of the law. There are, of course, other frameworks that have been ofered and these also can provide insightful perspectives. Among the initiatives that are related to, and often complementary to, the work presented in this paper are the following.

Guidelines. The Asilomar AI Principles [7] put forward 23 principles spanning research issues, ethics and values, and longer-term issues, for the research and development of AI. European Ethical Charter on the Use of AI in the Judicial Systems and their Environment [5] presents five principles, intended for both public and private stakeholders responsible for the design and deployment of AI tools and services that involve the processing of judicial decisions and data, and were adopted by the European Commission for the Eficiency of Justice (CEPEJ). The General Principles of Ethically Aligned Design [4] proposes eight principles upon which ethical and values-based design, development, and implementation of autonomous and intelligent systems (including artificial intelligence and intelligent assistance technologies designed for legal professionals; see the Chapter on Law) should be guided. The European Commission’s Ethics guidelines for trustworthy AI [6], drafted by the European Commission High-Level Expert Group on AI, puts forward seven key requirements that AI systems should meet in order to be deemed trustworthy. The Partnership on AI has drafted eight tenets [8] that its members, spanning industry, academia, and non-profit, “endeavor to uphold.”

Risk mitigation frameworks focused specifically on LLMs. Weidinger et al.[12] proposes a comprehensive taxonomy of ethical and social risks associated with large-scale language models, identifying twentyone risks across six risk areas and discussing approaches to risk mitigation. Bai et al.[13] proposes a method called Constitutional AI (CAI) to train a non-evasive and relatively harmless AI assistant without human feedback labels for harms, with the aim of developing techniques to create AI systems that adhere to design (or “constitutional”) principles, as opposed to learning from human feedback. Mökander et al.[14] proposes a three-layered approach to auditing large language models, which includes governance audits, model audits, and application audits—the third of which includes a component to check LLMs’ adherence to ethical principles. ample, has developed a MOOC on AI and the Rule of

Educational eforts. Long and Magerko[15] provides Law[21]; aiEDU, to cite another, is an initiative that proa concrete definition of AI literacy based on existing motes broad AI literacy through the development of AI research and synthesizes a variety of interdisciplinary curricula for use in a wide range of educational venues, literature into a set of core competencies of AI literacy, from K-12 schools to public museums[22]. as well as design considerations to support AI developers As potentially valuable as these educational initiatives and educators in creating learner-centered AI. Lin and are, they will be successful in meeting their objectives Van Brummelen[16] presents the findings from work- only insofar as they are able to access and convey accushops co-designed with K-12 teachers—that scafolding rate and meaningful content. This is where a challenge in AI tools and curriculum is needed for ethical and data appears: for some topics, namely topics related to the discussions, learner evaluation, engagement, peer collab- efectiveness of AI, the content is lacking (or at least lackoration, and critical reflection—and an exemplar lesson ing in the form required for fostering broadly distributed plan illustrating ways to teach AI in non-computing sub- AI competence). In this section, we examine this gap and jects within a remote setting. Gašević et al.[17] explores consider an approach to filling it. the theme of empowering learners for the age of AI and highlights the need for foundational discussions about 3.1. The need learning theory and conceptualizations of learning actions and behaviors in AI-human settings, as well as con- If we wish to foster an informed, and empowered,1 public, cerns regarding ethics, bias, and fairness in AI’s growing one capable of making empirically well-grounded deciinfluence. Hugging Face[ 18] has sought to democratize sions about the sorts of tasks to which AI should and machine learning knowledge and competence by ofering should not be applied, and the conditions that should educational materials for beginners as well as instructors. be met when it is applied, we need to ensure that the Hugging Face supported the BigScience open research public has access to accurate information about the efeccollaboration, which brought together more than 1,000 tiveness of AI (i.e., its capabilities and limitations when researchers from 60 countries and more than 250 institu- applied to real-world tasks). The problem is that evidence tions to create BLOOM[19], an openly and transparently of efectiveness of AI-enabled applications is spotty: it trained multilingual LLM. exists, and is accessible, in only a very incomplete and inconsistent manner. The reason is that there is no suitably authoritative institutionalized program for generating 3. Strengthening the AI operating the required evidence in a manner and format that can be environment through better readily consumed by individuals and civil society groups, thereby meeting the objective of giving citizens informed information agency over the use of AI in their (and their fellow citizens’) lives.

It is also worth noting that, while our focus in this section is on the means to foster an informed public, the evidence gap just observed has wider implications. It acts as roadblock not only to meeting the objective of an informed public but also to meeting the other objectives of the principles that have been articulated for the responsible use of AI (which may be stated,2 at an abstract level, as ( 1 ) protection of core values, ( 2 ) creation of conditions needed for an informed trust, and ( 3 ) advancement technological innovation and economic prosperity).

Without sound evidence of efectiveness, we will be unable to protect core values, because we won’t know (a) whether the AI-enabled systems achieve their immediate goals nor (b) whether, even in achieving those immediate

An environment in which those who would use, or be

afected by, AI-enabled applications lack at least baseline information about AI (what it is, where it is, how it works, and how well it works) is an environment conducive to misuse (not to mention to unhelpful, even harmful, hype).

Conversely, an environment in which both active and passive users of AI are well-informed about AI’s use cases, its conditions of use, and its strengths and weaknesses is one that will be more resistant to AI misuse (and to the risks associated with such misuse). An important component of any efective strategy for containing the risks associated with AI will therefore be education: if we can foster a public that is better informed about AI, we will foster a public better equipped to recognize and guard against the risks associated with it.

The role of education in risk containment has been rec- 1Informed, of course, does not necessarily mean empowered. Adognized for some time. Education was one of the three vancing the empowerment of citizens means not only ensuring themes of the inaugural edition (2019) of The Athens ltehgaatlcaitnidzepnrsahctaivcaelaccocnedssititoonisnfaorremsuatcihonthbauttcaitliszoenesnscuanrinagcttohnattthhaet Roundtable on Artificial Intelligence and the Rule of information.

Law[20]. Education is also the focus of a number of 2This threefold classification scheme for the objectives of principles current initiatives. The Future Society, to cite one ex- is the authors’ own; other classification schemes can be found in [ 9] and [10] • Narrow focus. The research objectives of the evaluations are often such that they are better served by narrowly circumscribing the scope of the exercise, not measuring the impact of the whole sociotechnical system, of which the technology is a part, on the values with which the public may be concerned. Consistent with these objectives, the evaluations gauge the performance of the technologies being evaluated using metrics specifically relevant to the capability addressed in the study; they do not seek measures that would provide a comprehensive view of the technology’s fitness for purpose. • Distance from real-world circumstances. In the interest of arriving at a well-controlled answer to specific research questions, the studies often do not make allowance for variability in all the factors that could, in a real-world setting, afect a system’s efectiveness. The result is an exercise that is removed from the real-world circumstances. Moreover, obtaining evaluation data sets that, in both size and character, are reflective of the data populations to which the technology under evaluation would be applied in a real-world setting is a challenge that current evaluations are often unable to meet. goals, they impinge on other core values. We will be unable to create the conditions of trust, because we will lack the empirical data that is the basis for a wellgrounded trust (or distrust) [23]. We will be unable to advance the goals of technological innovation and economic prosperity, because we will lack the information needed to optimize the allocation of research efort and financing. In terms of approach, moreover, having access to sound evidence of efectiveness is necessary for both bottom-up and top-down approaches to risk • Misalignment of purpose. Many currently containment. With regard to the latter, as illustrated in available evaluations are of the one-of variety: Figure 1, evidence of efectiveness is necessary both for they are designed to produce just the data needed the formulation of viable normative instruments and for for the study that occasioned them and they are the assessment of adherence to the norms instantiated not intended to be repeated on a regular basis. An in such instruments. In short, evidence of efectiveness additional limitation that is particularly characis needed both for the general objective of ensuring the teristic of industry white papers is that they are responsible use of AI-enabled systems and for the specific generally designed, not to provide a well-rounded objective of fostering an informed public. view of the technology’s fitness for purpose, but

Now, to say that the required evidence is lack- to highlight characteristics of the enterprise’s ofing is not to say that there is no evidence at all. fering that, the enterprise believes, will resonate There is indeed a healthy flow of reports, of var- in the marketplace. ious types, of evaluations of the efectiveness of AI-enabled systems. These include: academic re- 3.2. A proposal for meeting the need search papers[24][25][26]; industry white papers; reports of government-sponsored evaluations[27][28]; If the objective of an informed (and empowered) public is evaluations conducted by non-governmental civic a worthy one, and if a lack of evidence of the efectiveness organizations[29][30][31]; and academic and industry- of AI-enabled systems is impeding the achievement of sponsored benchmarking initiatives[32][33]. that goal, then what might a solution that removed that

The problem is that these evaluations, while well- impediment look like? What we propose, and what we designed to meet their own objectives, have not been discuss in the remainder of this section, is the creation designed specifically to meet the objective of fostering of an on-going institutionalized program of interoperaa general public that is informed and empowered. As ble open AI benchmarks, the purpose of which would a result, the evidence the studies generate is lacking in be to supply the empirical evidence needed to foster a key features required to meet that objective. Among key public empowered to make informed decisions about the limitations of current evaluations3 are the following. use of AI-enabled technologies. The benchmarks should be “open” in the sense that exercises must be transpar

3We are, of course, not of saying that all currently available evalua

tions are subject to all of these limitations. We are saying simply that each evaluation is subject to at least one of them. ent: data used, procedures followed, and results generated must all be open to inspection (or, in some cases, audit), by both participants and independent observers. They should be “interoperable” in the sense that they will supply evidence usable by all regulatory regimes, regardless of the specific goals and priorities that are operative within any specific jurisdiction. Furthermore, if they are to serve their intended purpose of fostering a better-informed public, they should generate results that can be understood by both experts and non-experts.

3.2.1. Requirements

A benchmarking program that will meet the general objective of fostering an informed public (a public that includes everyone from researchers and designers, to policymakers and lawyers, all the way to the potentially involuntary decision subjects of judicial or enforcement technologies) will have to meet certain requirements. It must: ( 1 ) design evaluations that model real-world circumstances; ( 2 ) generate results that will be meaningful, and actionable, for a wide range of stakeholders; ( 3 ) run evaluation exercises that are consistent and trusted; and ( 4 ) be practically viable. Specific implications of these basic requirements are the following. enough to allow informative comparison from one run of an exercise to the next), ( 3 ) that the program should be institutionalized (i.e., have the legitimacy and durability that come from sponsorship by recognized public authorities), and ( 4 ) that the design and execution of the evaluations run in the program be transparent (data used, procedures followed, and results generated must all be open to inspection by both participants and independent observers). • Practical. In order to be viable, the program must also meet a number of non-trivial practical requirements. These include ( 1 ) reaching consensus on metrics for concepts and tasks where that consensus is currently elusive[34], ( 2 ) obtaining fresh and meaningful data sets on a regular basis, ( 3 ) achieving broad participation (which means having low barriers to entry, in terms of both cost and reputational risk), and ( 4 ) producing its results in a timely and eficient manner.

Meeting the requirements and challenges on this list will not be a trivial undertaking. Fortunately, those who would create a benchmarking program aligned with this vision are not without resources upon which to draw. • Real-world. In order to be relevant to real-world As we have already seen, researchers have been designpractice, it is essential that the evaluations con- ing and conducting evaluations of AI-enabled systems ducted in the benchmarking program ( 1 ) closely for many years. While those evaluations have not been model real-world conditions and objectives and designed for the same purposes as those that would be ( 2 ) take as the target of their measurement the run in the proposed benchmarking program, they can whole system of which the AI-enabled technol- still serve as a valuable resource for those seeking to ogy is a part. A failure to do so would be a failure address the requirements and challenges of a meaningto provide the evidence actually required by the ful AI benchmarking program. A few examples of such public. resources are the following.4 • Meaningful. In order to be actionable, it is essential that the results generated by the benchmarking program ( 1 ) be expressed via meaningful metrics and ( 2 ) be interoperable across national and other jurisdictional boundaries. With regard to metrics, “meaningful” means that they should be ( 1 ) statistically sound, ( 2 ) relevant, and ( 3 ) and understandable both to experts and non-experts.

The interoperability requirement means that the results of the exercises should be broadly usable, providing information that can be acted upon regardless of the specific goals and priorities that are operative within any specific jurisdiction. • Consistent and trusted. If the public is to rely upon the results produced by a benchmarking program, the results must be generated in a consistent and trusted manner. This means, specifically, ( 1 ) that the evaluations should be run on a periodic schedule, ( 2 ) that the evaluations should be of a reasonably consistent design (at least consistent • The series of studies conducted in the

NIST-sponsored Text Retrieval Conference (TREC)[27].5 • The HELM (Holistic Evaluation of Language Models) initiative undertaken by Stanford’s Center for

Research on Foundation Models [33]. • METRICS – An international competition for the

evaluation of robotics and AI[36]. • NIST’s 2021 AI Measurement and Evaluation

Workshop[37].

4It is worth emphasizing that this is simply a selection of examples,

not an exhaustive list of available resources. 5We note that the series of evaluations conducted in the TREC Legal Track from 2006 through 2011[35], which produced data on the efectiveness of advanced technologies at the task of legal discovery, and thereby provided empirical grounding for decisions as to whether to adopt those technologies (or, in the case of courts, to allow their adoption), illustrates the potential that a well-designed on-going program of benchmarks could hold for creating a betterinformed public. • A framework developed by the AI Ethics Impact

Group for operationalizing AI ethics[38]. • Academic papers published in the proceedings of

AI-focused conferences[25][26].

The above is of course not an exhaustive list of available resources; it is intended simply to illustrate the sorts of resources may build upon in developing a benchmarking program that would meet the need we have identified. 3.2.2. Benefits

While designing and implementing a program that meets the requirements we have identified would be a challenge, the benefits of meeting that challenge are significant and tangible. By filling the evidence gap, the program would help to foster a public that was better informed about the real capabilities, limitations, and risks of AI-enabled systems (including those drawing upon LLMs). It would do so both directly, insofar its results were consumed by members of the public without the mediation of other entitiies, and indirectly, insofar as its results reached the public through the mediation of civil society groups or educational initiatives focused on questions of society and technology. A better informed public would, in turn, be one better positioned to recognize, and address, risks to core human values, to protect the liberty, privacy, and dignity of the individual, to resist the temptation of unwarranted hopes or fears about AI, and to support measures that further, in a responsible manner, scientific innovation and economic prosperity.

Apart from these primary benefits, such a program would also bring a number of collateral benefits. These include: ( 1 ) thanks to its provision of empirically sound and readily understandable evaluations of efectiveness, providing policymakers and regulators with the basis for evidence-based decision making, ( 2 ) thanks to its meeting the interoperability requirement, fostering international cooperation, and ( 3 ) thanks to its addressing the challenges of defining and obtaining metrics for complex concepts and goals, advancing consensus around metrics and evaluation design.

3.2.3. Action to date In recognition of both the benefits and the challenges of

developing a benchmarking program that would meet the requirements we have identified, preliminary work has begun on the design and implementation of such a program. More specifically, under the auspices of the IEEE and The Future Society, a working group has been formed to explore the advisability and feasibility of pursuing such a project. The group includes representation from key agencies on both sides of the Atlantic. To date, the group has reached agreement on the need and the outlines of a program that would meet the need. Its current focus is on exploring practical questions related to how such a program should be developed. The group has not yet set a timetable for reporting on the results of its exploratory work.

4. Strengthening the AI operating environment through better tools

In the previous section, we considered a proposal aimed at strengthening the human environment in which AI is deployed through the fostering of a better-informed public. More specifcally, the proposal seeks to create a better-informed public through the establishment of an institutionalized program of open and interoperable AI benchmarking evaluations which have been designed to gather and publish sound evidence regarding the capabilities and limitations of AI-enabled systems when applied in real-world circumstances.

The evidence generated by benchmarking evaluations is a key input to a sound assessment of the trustworthiness of a technology. A well-designed benchmark (one accurately modeling real-world conditions, using data sets representative of those likely to be encountered in the actual application of a technology, and quantifying the various aspects of efectiveness through meaningful metrics) can tell us what we can reasonably expect (in terms of both capabilities and limitations) from a given technology in a given circumstance. That expectation can then be used to decide whether we have a plausible basis for trusting the technology to perform the task we are asking of it. The evidence generated by a benchmarking evaluation cannot, however, tell us whether the technology in question, once it has been applied, has in fact met its objectives in the specific circumstance in which we have applied it. If we want that information, we need to turn to local validation.

The results generated by a local validation exercise (a real-time or after-the-fact test of the efectiveness achieved by a given technology in a specific circumstance) are complementary to those generated by benchmarking evaluations. The latter tell us whether we have empirical grounds for believing that a technology of a given class will be successful in circumstances broadly similar to those modeled in the benchmark; the former tell us whether we have empirical grounds for believing that a specific instance of a technological system was successful in the specific circumstances in which we did apply it (specific data, specific hardware conditions, specific operators, specific timetables, and so on). Both questions are relevant in assessing the trustworthiness of a technology. The general question (answered by benchmarking evaluations) is most relevant before application, those same competencies must be much more widely when we are deciding whether to adopt the technology distributed. Individuals at a geographically very broad for a given task. The specific question (answered by local range of sites of AI deployment will need to be supplied validation) is most relevant after (or during) application, with the competencies required to run meaningful tests when we are deciding whether to trust the results that of the technology as it has been deployed at their sites have actually been generated by the technology. Having and in their specific circumstances. Meeting this need reliable answers to both questions is essential to putting does not mean that every member of the public has to the adoption and use of advanced technologies in the be equipped with the competencies required to design service of the law on an empirically sound footing. and run evaluations; it will sufice to widen the circle of

The complementary relationship between the two competence to a broader range of domain experts. This types of inquiry can be illustrated with an example taken is still a challenge however: how do we bring about a disfrom legal discovery in the US. The evaluations conducted tribution of the required competencies that is suficiently in the TREC Legal Track (2006-2011)[35] produced results broad to answer the need for local validation?7 that showed that advanced retrieval technologies (often termed “technology-assisted review” or “TAR”) could be 4.2. A proposal for meeting the need reasonably efective at performing the task of retrieving documents responsive to a request for production.6 If the objective of fostering a user pool better-equipped That evidence gave responding parties the empirical ba- to gather evidence of the efectiveness of AI-enabled syssis they needed to adopt some variety of that class of tems at the site of deployment is a worthy one, and if technologies as the means to meet their discovery obliga- achieving that objective means bringing about a wider tions (and, importantly, gave courts the empirical basis distribution of the competencies required to design and they needed to license that adoption). That evidence did run sound local validation exercises, what might a solunot, however, obviate the need for local validation of tion that enabled that distribution look like? What we the results generated by a given technology in a given propose, and what we discuss in the remainder of this matter. Requesting parties, and courts, still expect the section, is the creation of a repository of resources that circumstance-specific, after-the-fact, results that come can be accessed by operators seeking guidance on how only from local validation (and these expectations are to design and run local validation exercises. often encoded in ESI (“electronically stored information”) protocols which govern discovery procedures in a given 4.2.1. Requirements matter). The general (TREC) evaluations provided the plausibility that gave the green light for adoption, but the matter-specific (local) evaluations are still needed to provide the evidence that establishes the soundness of the actual results.

If we wish to provide domain experts and operators with

the resources needed to conduct local testing of the systems they are overseeing, the resources we make available to them must meet a number of requirements. Chief among these are the following. 4.1. The need If local validation is an important element in an assessment of the trustworthiness of a technology, then there is a need to bring about the conditions needed to ensure that sound local validation exercises can be conducted often and everywhere. Here, however, there is a challenge. Whereas, in the case of benchmarking evaluations, the competencies required to design and run meaningful and statistically sound tests of the efectiveness of a technology can reside in a relatively small number of individuals (the individuals organizing and running the benchmarking program), in the case of local validations,

6The choice of modal is important here: the studies showed that TAR

could achieve reasonably high levels of recall and precision; they did not show that TAR would, in all its instantiations and in all circumstances, achieve those results. Hence the need for local validation. This point is also sometimes insuficiently appreciated by readers of [39], which analyzed that showed that TAR can be superior to manual review (not that it will be superior in all circumstances). • Application-specific. The testing that is required will vary from application to application. What is required for the local validation of an instance of TAR applied to the task of discovery, for example, will difer from that which is required for the local validation of a risk-assessment technology applied to custody decisions. The resources must therefore be application-specific and the ultimate goal should be the creation of a “library” of resources, each of which is tailored (in terms of test design, metrics, sampling procedures, interpretive guidance, and so on) to a specific task to which an AI-enabled system may be applied.

7Of course, there will not be a need for local validation for every

deployment of AI, but even restricting to deployments of sensitive applications, and even allowing for some level of aggregate testing of deployed technologies, there will still be a need for achieving a much wider distribution of the required competencies than we have today.

The creation of a repository of resources like that pro

posed in this section is no small undertaking; realizing the vision will require input from experts from a wide range of disciplines and subject-matter areas. The beneifts of such a repository, however, would be considerable.

These include: • Tutorial and procedural content. The re- 4.2.2. Benefits sources should provide not only a procedural “recipe” for conducting a test, but should also provide suficient tutorial content to enable an operator to understand the motivation behind a given procedural step (what a given term-of-art means, why a given metric is being used, why a given sampling design is chosen, and so on). To be efective, these resources should be calibrated for users with intermediate levels of expertise in the use and testing of advanced legal technologies. They need not be at the level of academic research papers, but they do have to go beyond elementary introductions. • Intended audience. The resources should be carefully calibrated to the level of expertise of their intended audience. Those who will be responsible for conducting local validation exercise will be a smaller, and technically more advanced, group than those consuming the results of those evaluations. The resources be calibrated to meet the requirements of these more expert users (while, to the extent possible remaining within the grasp, at least at a high level, of nonexpert users). • Adaptable. Even with the boundaries of a specific domain and task, there will be considerable 4.2.3. Action to date circumstance-specific variation from one deployment of a system to another. The guidance pro- The repository we have proposed remains, at the moment, vided by the resources should be of a suficient aspirational; there is as yet no program under way to depth to enable an operator to adapt the specified create it. Work has begun, however, on creating materials procedures for use in the specific circumstances that would meet the requirements specified for resources at hand. in the repository and that could serve as a model for other resources.

More specifically, under the auspices of the IEEE and The Future Society, a project has been initiated, and in fact is nearing completion, to create a set of resources that, in the specific domain of legal discovery, will enable practitioners to conduct meaningful local validation of the results of applying advanced review technologies (or, for that matter, to the results of applying any review technology) to the task of legal discovery. The specific materials we have drafted are the following.

• Improved competence; • Improved efectiveness; • Strengthened trust; • Improved risk containment; • More broadly distributed agency. • Direction to other resources. As a practical matter, the resources cannot cover every circumstance likely to be encountered in the real-world.

While they should be of suficient depth to cover the most common circumstances, they should provide direction to additional resources (including human resources) to consult when less typical circumstances are encountered.

What we have listed above are general requirements

that any resource must meet if it is to serve the purpose of distributing the competencies needed to enable more frequent and efective local validation of AI-enabled systems. What we have not specified, however is any particular format for the resources. That is by design. There are, in fact, a range of diferent formats such resources might take (written procedures, glossaries, handbooks, video tutorials online calculators, and so on), and which format will be most efective will vary from one domain (and audience) to the next. We therefore leave the specific format as a question to be decided at the implementation stage. • A Model Protocol. An adaptable model ESI protocol that addresses the key issues that currently trouble parties in the discovery phase of litigation. The Protocol focuses on gathering the evidence needed to have an informed trust in the results of a review; its provisions are shaped by the principles of proportionality and evidencebased decision-making. • A Commentary. A line-by-line commentary on the Protocol. The Commentary is designed to provide justification, interpretive guidance, and tutorial background for the Protocol’s provisions. • A Handbook for Practitioners. A companion document that provides an expanded discussion of the sampling and measurement procedures specified in the Protocol. The Handbook is intended to serve as a resource for advanced practitioners (and other stakeholders) seeking a deeper and more detailed understanding of the required statistical procedures. more nuanced and domain-specific approaches to risk containment; and • By distributing knowledge more broadly (whether that distribution is direct or mediated by other entities or initiatives) advance the empowerment of the individual (against both private and state actors).

Given these benefits, we think that policymakers, and other stakeholders engaged in advancing the responsible use of AI, should always maintain an environmentfocused (or bottom-up) track as a complement to the application-focused (or top-down) track. In fact, given the more benign collateral implications of environmentfocused approaches, they should often be viewed as the solution of first recourse.

These materials have been drafted and are currently being reviewed by a group of experts with a range of diferent perspectives on the use of advanced technologies for legal discovery and on how to put that use on the basis of an informed trust. We plan to publish the materials in 2023. Our hope is that the materials will both serve their immediate purpose of putting the use and testing of e-discovery technologies on a sounder footing and serve the larger purpose of serving as a model for resources that will enable the wider distribution of the competencies needed to conduct local validation of AI-enabled systems in other domains.

5. Concluding remarks

In this paper, we have drawn attention to the environment in which AI-enabled systems are deployed as a key element in any strategy for containing the risks (and for Acknowledgments realizing the potential) attendant on the use of such systems. We have focused, more specifically, on the human We would like to thank the organizers of 3rd Internacomponent of the environment and considered two ap- tional Workshop on Artificial Intelligence and Intelligent proaches (generating better information about AI’s real Assistance for Legal Professionals in the Digital Workcapabilities and limitations, creating tools that will en- place (LegalAIIA 2023) for providing a forum at which able practioners to conduct local validation of the results we could express our views and hear those of others of AI-enabled applications) for strengthening that com- interested in these topics. ponent against risk. There are, of course, other aspects of the environment in which AI-enabled systems are deployed (hardware, software, even legal and financial) References and exploration of ways to strengthen those components (making then more conducive to the detection, reporting, and resolution of risks) could pay of in more efective or eficient approaches to risk containment. One practical example is creating readily accessible pathways and repositories that would allow users (especially betterinformed and better-equipped users) to report anomalies they have observed and to compare their observations with those submitted by others[4].

As can be seen by reviewing the requirements specified for the two proposals we have considered, the work required to strengthen the environment against AI-associated risk is non-trivial. To be successful, approaches on the environment-focused track require a considerable amount of planning, coordination, and effort. The benefits these approaches bring, however, are significant. Evironment-focused approaches may: • By distributing more broadly the means for identifying and responding to unwanted outcomes from AI-enabled applications, avoid some of the adverse efects on innovation and technological development that may be occasioned by topdown approaches; • By allowing practitioners to tailor solutions to

their particular objectives and conditions, enable [8] Partnership on AI, PAI Tenets, 2016. URL: https: Stanford University (2006).

//partnershiponai.org/about/#tenets. [24] IJCAI, Artificial Intelligence, 2023. URL: [9] J. Fjeld, N. Achten, H. Hilligoss, A. Nagy, M. Sriku- https://www.sciencedirect.com/journal/ mar, Principled artificial intelligence: Mapping artificial-intelligence. consensus in ethical and rights-based approaches [25] AAAI, Association for the Advancement of Artifito principles for ai, Berkman Klein Center Research cial Intelligence, 2023. URL: https://www.aaai.org/.

Publication (2020). [26] IAAIL, International Conference on Artificial Intel[10] T. Hagendorf, The ethics of AI ethics: An evalua- ligence and Law (ICAIL), 2023. URL: http://www. tion of guidelines, Minds and machines 30 (2020) iaail.org/.

99–120. [27] NIST, Text REtrieval Conference (TREC), 2023. URL: [11] Y. Zeng, E. Lu, C. Huangfu, Linking artificial intelli- https://trec.nist.gov/.

gence principles, 2018. arXiv:1812.04814. [28] Ministere de la Justice, Communique du Ministere [12] L. Weidinger, J. Uesato, M. Rauh, C. Grifin, P.-S. de la Justice et de la premiere presidence de la cour Huang, J. Mellor, A. Glaese, M. Cheng, B. Balle, d’appel de Rennes, 2017.

A. Kasirzadeh, et al., Taxonomy of risks posed by [29] World Economoic Forum, Responsible Limits on language models, in: 2022 ACM Conference on Facial Recognition; Use Case: Flow Management; Fairness, Accountability, and Transparency, 2022, Part II: Pilot phase: Self-assessment, the audit manpp. 214–229. agement system and certification, 2020. URL: https: [13] Y. Bai, S. Kadavath, S. Kundu, A. Askell, J. Kernion, //www3.weforum.org/docs/WEF_Responsible_ A. Jones, A. Chen, A. Goldie, A. Mirhoseini, C. McK- Limits_on_Facial_Recognition_2020.pdf . innon, et al., Constitutional ai: Harmlessness from [30] Snow, J, Amazon’s Face Recognition Falsely ai feedback, arXiv preprint arXiv:2212.08073 (2022). Matched 28 Members of Congress with [14] J. Mökander, J. Schuett, H. R. Kirk, L. Floridi, Au- Mugshots, 2018. URL: https://www.aclu.org/blog/ diting large language models: a three-layered ap- privacy-technology/surveillance-technologies/ proach, AI and Ethics (2023) 1–31. amazons-face-recognition-falsely-matched-28. [15] D. Long, B. Magerko, What is ai literacy? compe- [31] C. Garvie, A. Bedoya, J. Frankle, The perpetual tencies and design considerations, in: Proceedings line-up, Georgetown Law Center on Privacy & of the 2020 CHI conference on human factors in Technology 18 (2016).

computing systems, 2020, pp. 1–16. [32] NIST, Face Recognition Vendor Test (FRVT), 2023. [16] P. Lin, J. Van Brummelen, Engaging teachers to URL: https://www.nist.gov/programs-projects/ co-design integrated ai curriculum for k-12 class- face-recognition-vendor-test-frvt-ongoing. rooms, in: Proceedings of the 2021 CHI conference [33] Center for Research on Foundation Models, Holistic on human factors in computing systems, 2021, pp. Evaluation of Language Models (HELM), 2023. URL: 1–12. https://crfm.stanford.edu/helm/latest/. [17] D. Gašević, G. Siemens, S. Sadiq, Empowering learn- [34] NIST, AI Measurement and Evaluation ers for the age of artificial intelligence, Computers Panel on Measuring Concepts that are Comand Education: Artificial Intelligence (2023) 100130. plex, Contextual, and Abstract, 2021. URL: [18] Lepercq, V, Introducing Education, 2022. URL: https: https://www.nist.gov/news-events/events/2021/ //huggingface.co/blog/education. 06/ai-measurement-and-evaluation-workshop. [19] Hugging Face, Introducing The World’s Largest [35] TREC, TREC Legal Track, 2011. URL: https:// Open Multilingual Language Model: BLOOM, trec-legal.umiacs.umd.edu/. 2023. URL: https://bigscience.huggingface.co/blog/ [36] LNE, METRICS - An international competition for bloom. the evaluation of robotics and AI, 2023. URL: https: [20] The Athens Roundtable, The Athens Roundtable: //metricsproject.eu.

Artificial Intelligence and the Rule of Law, [37] NIST, AI Measurement and Evaluation, 2021. URL: 2019. URL: https://www.aiathens.org/dialogue/ https://www.nist.gov/news-events/events/2021/ ifrst-edition. 06/ai-measurement-and-evaluation-workshop. [21] The Future Society, MOOC on AI and the Rule [38] S. Hallensleben, C. Hustedt, From principles to of Law, 2022. URL: https://thefuturesociety.org/ practice: An interdisciplinary framework to opera2022/05/12/mooc-on-ai-and-the-rule-of-law\ tionalise AI ethics, Bertelsmann Stiftung, 2020. -successful-completion-of-the-pilot-phase/. [39] M. R. Grossman, G. V. Cormack, Technology[22] aiEDU, aiEDU: The AI Education Project, 2023. URL: assisted review in e-discovery can be more efective https://www.aiedu.org. and more eficient than exhaustive manual review, [23] C. McLeod, E. N. Zalta, Trust in stanford encyclo- Rich. JL & Tech. 17 (2010) 1. pedia of philosophy, Metaphysics Research Lab,

[1]

Shanahan , Talking about large language models , arXiv preprint arXiv:2212.03551 ( 2022 ).

[2]

E. M.

Bender ,

Koller , Climbing towards NLU: On meaning, form, and understanding in the age of data, in: Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics, Association for Computational Linguistics , Online, 2020 , pp. 5185 - 5198 . URL: https://aclanthology.org/ 2020 .acl-main. 463 . doi: 10 . 18653/v1/ 2020 .acl-main. 463 .

[3] OECD , Principles on AI , 2019 . URL: https://oecd.ai/ en/ai-principles.

[4] IEEE, Ethically Aligned Design, Version 1 , 2019 . URL: https://standards.ieee.org/ industry-connections/ec/ead1e-infographic/.

[5] Council of Europe, European Ethical Charter on the Use of Artificial Intelligence in Judicial Systems and their Environment , 2018 . URL: https://rm.coe. int/ ethical-charter-en-for- publication- 4 - december-2018 / 16808f699c.

[6]

European

Commission , Ethics Guidelines for Trustworthy AI , 2019 . URL: https://digital-strategy.ec.europa.eu/en/library/ ethics -guidelines-trustworthy-ai.

[7] Future of Life, Asilomar AI Principles , 2017 . URL: https://futureoflife.org/ 2017 /08/11/ai-principles/.