1. Introduction

URL: https://hdl.handle.net/

Ethical AI Systems and Shared Accountability: The Role of Economic Incentives in Fairness and Explainability

Dae-Hyun Yoo

Caterina Giannetti

0 0 Department of Economics and Management, University of Pisa , Pisa PI 56124 , Italy

1721

This paper presents a principal-agent model for aligning artificial intelligence (AI) behaviors with human ethical objectives. In this framework, the end-user acts as the principal, ofering a contract to the system developer (the agent) that specifies desired levels of ethical alignment for the AI system. The developer can exercise varying levels of efort to achieve this alignment, with higher levels - such as those required in Constitutional AI - demanding more efort and posing greater challenges. To incentivize the developer to invest more efort in aligning AI with higher ethical principles, appropriate compensation is necessary. When ethical alignment is unobservable and the developer is risk-neutral, the optimal contract achieves the same alignment and expected utilities as when it is observable. For observable alignment, a fixed reward is uniquely optimal for strictly risk-averse developers, while for risk-neutral developers, it remains one of several optimal solutions. This simple model demonstrates that balancing responsibility between users and developers is crucial for fostering ethical AI. Users seeking higher ethical alignment must not only compensate developers adequately but also adhere to design specifications and regulations to ensure the system's ethical integrity.

eol>Ethical Alignment Asymmetric Information Principal-Agent Model Responsibility Allocation Constitutional AI

1. Introduction

As artificial intelligence (AI) systems become increasingly integrated into society and tasked with making complex decisions on behalf of humans, ensuring the ethical alignment between AI behavior and human values is essential for fostering trust and collaboration [ 2 ]. However, the ethical alignment problem in AI is complicated by the involvement of multiple entities—developers, deployers, and users—each of whom may have diferent objectives, incentives, and levels of information [ 3 ]. This misalignment can lead to conflicts, especially when users delegate the ethical design of AI systems to developers, who often possess more information but may not fully share the users’ ethical goals. The ethical alignment challenge in AI systems mirrors the principal-agent problem commonly observed in economics, where discrepancies arise between the interests of a principal (e.g., the user) and an agent (e.g., the system developer). In AI, such misalignment can occur due to incomplete information, reward misspecification, or diferences in values [ 2, 4, 5 ].

As AI systems, such as autonomous vehicles and large language models, take on more decisionmaking authority, addressing misalignment with human ethical standards becomes critical. Developers have the flexibility to exert varying levels of efort when integrating ethical objectives into an AI system. One approach, known as Constitutional AI, involves a training process in which a language model is guided by a set of ethical principles, referred to as a "constitution" [ 6, 7 ]. These principles are systematically instilled in the model throughout its development, shaping its behavior to align with ethical guidelines. This approach ensures that the AI makes decisions and provides outputs that reflect these predefined standards, creating a framework for responsible and transparent AI operation. However, achieving a higher degree of ethical alignment requires significantly more efort, expertise, and resources from the developer.These increased costs stem from the complexity of embedding stronger ethical principles, ensuring compliance with evolving guidelines, and addressing unforeseen dilemmas in AI decision-making. The greater the desired ethical rigor, the more challenging and resource-intensive the development process becomes, both in terms of technical implementation and ongoing oversight.

To investigate the various possibilities for aligning a system, this paper adapts a basic principal-agent model from economics [8] to explore how responsibility for ethical AI systems can be distributed among diferent stakeholders through economic incentives. By focusing on the contractual relationship between users and system developers, we analyze optimal reward schemes that incentivize developers to align AI behaviors with human ethical objectives. This model contributes to the growing discussion on how to allocate responsibility for ethical AI and ofers insights into how economic mechanisms can be used to mitigate ethical risks in AI deployment.

2. Principal-Agent Model 2.1. Assumptions

The variable represents the observable benefits that arise from deploying ethically aligned AI systems. While influenced by the level of ethical alignment ( ), is not entirely determined by it and take values within [ , ]. The relationship between and is characterized by a conditional density function, ( |), where ( |) > 0 for all ∈ and ∈ [ , ]. This introduces uncertainty, as any realization of can occur for a given level of ethical alignment.

The level of ethical alignment , chosen by the system developer, represents the efort made to align AI systems with ethical objectives. The set encompasses all available ethical alignment levels, with two primary options: {︃1 : high ethical alignment

2 : low ethical alignment

We assume that the efort level 1, which corresponds to higher ethical alignment, yields greater benefits for the user (principal) but imposes greater challenges on the system developer (agent). These challenges arise due to the increased complexity and resource demands of implementing stronger ethical guidelines, as well as ensuring compliance and addressing unforeseen dilemmas in the AI’s decision-making process.

This creates a conflict of interests between the user and developer. More specifically, the distribution of conditional on 1 first-order stochastically dominates that of 2. The conditional density functions ( |1) and ( |2) satisfy;

( |1) ≥ ( |2) and the distribution functions: ( |1) ≤ ( |2) at all ∈ [ , ] ∫︁

∫︁ ( |1) > ( |2) , with strict inequality on some interval. This implies that the expected benefits from 1 exceed those from 2;

The system developer is an expected utility maximizer with a Bernoulli utility function (, ) over reward () and ethical alignment level (), satisfying; (, ) > 0 and (, ) ≤ 0 for all (, ) (, 1) < (, 2) for all (1) (2) (3) Thus, the developer prefers higher rewards but dislikes a high level of ethical alignment. The choice of 1 provides greater benefits to the user but imposes more "disutility" on the developer compared to 2. We focus on a specific utility function commonly used in the literature [8]: , where

(, ) = () − () ′() > 0, ′′() ≤ 0, and (1) > (2)

The user, assumed to be risk-neutral, seeks to maximize expected returns, receiving the benefits of ethical alignment minus the rewards paid to the system developer.

2.2. The Optimal Contract with Observable Ethical Alignment Level

Suppose the user ofers a contract specifying the ethical alignment level ∈ {1, 2} and the system developer’s reward as a function of observed benefits ( ). The system developer must receive an expected reward at least equal to ¯, the reservation utility, if they accept the contract. If they reject it, they receive zero. The developer is assumed to find it worthwhile to align the AI system to the ethical objectives set by the contract.

The user’s objective is to choose the optimal contract to maximize their expected benefits: or reward.

(4) (A.1)

(5) (A.2) (A.3) (A.4) (A.5) subject to the same constraint. The constraint always binds at the solution, as lowering the reward would prevent ethical alignment.

Let denote the multiplier on the constraint, (5). The optimal reward scheme satisfies: − ( | ) + · ′(( )) · ( | ) = 0 =

1 ′(( )) (* ) − () = ¯ subject to the constraint: Choosing ( ) to minimize the user’s reward costs reduces to: (( )) · ( |) − () ≥ ¯

∫︁ Min ( )

( ) · ( |)

Max ∈{1,2},( )

∫︁ ∫︁

( − ( )) · ( |)

If the system developer is risk-averse (i.e., ′(( )) is decreasing), the optimal reward is a fixed amount, reflecting a risk-sharing result. The risk-neutral user insures the risk-averse developer by ofering a fixed reward

* that satisfies: Since (1) > (2), it follows that *1 > *2 , meaning higher ethical alignment results in a higher When the developer is risk-neutral (i.e., () = ), a fixed reward is just one of many optimal schemes, provided the expected reward is ¯+

().

To determine the optimal , the user selects the ethical alignment levels ∈ {1, 2} that maximizes: ∫︁

· ( |) − − 1(¯+ ()) of . alignment level * solves: The first term represents the gross benefit from the ethically aligned AI system, while the second term represents the rewards paid to the developer for alignment efort. Whether 1or 2 is optimal depends on the trade-of between the incremental benefits of 1 over 2 and the disutility imposed on the developer.

Specifically, if the additional benefits from a higher level of ethical alignment under 1 outweigh the increased cost and efort required from the developer, then 1 becomes optimal. However, if the marginal benefit is insuficient to cover the greater efort and resource demands of achieving a more stringent ethical standard, then 2 may be the preferred choice. This balance reflects the fundamental tension between maximizing ethical outcomes and managing the practical limitations faced by developers, particularly in complex frameworks like Constitutional AI, where higher ethical alignment often requires significantly more efort, expertise, and oversight.

Proposition 1. In the principal-agent model with observable ethical alignment, the optimal contract specifies the level of ethical alignment * that maximizes the user’s benefits. The system developer receives a fixed reward * = − 1(¯+ (* )) if risk-averse. When the developer is risk-neutral, a fixed reward is one of many possible optimal reward schemes.

2.3. The Optimal Contract with Unobservable Ethical Alignment Level

The optimal contract described in Proposition 1 achieves two objectives: it specifies an eficient level of ethical alignment and insures the system developer against reward risk. However, when the ethical alignment level is not observable, these objectives conflict, as the developer’s pay must be tied to the uncertain benefits to incentivize alignment. This leads to a welfare loss due to the non-observability Suppose the system developer is risk-neural, so () = . Under full observability, the optimal · ( |) − () − ¯ (A.6) The user’s benefits are the value of expression (A.6), and the developer receives an expected utility of ¯. When the developer’s efort is unobservable, Proposition 2 states that the user can still achieve the full-information payof. under full observability.

Proposition 2. In the principal-agent model with unobservable ethical alignment and a risk-neutral system developer, an optimal contract results in the same ethical alignment level and expected utilities as developer chooses to maximize his utility, Proof. The user ofers a contract ( ) = − , where is a fixed payment (“alignment price"). The Since * maximizes (A.7), this contract induces the first-best alignment efort level * .

The developer accepts this contract if it provides at least ¯ in expected utility: Let * be the value of where (A.8) holds with equality. Rearranging: (A.7) (A.8) (A.8.1)

Max ∈{1,2} ∫︁ ∫︁

Max ∈{1,2} = ∫︁

∫︁ ∫︁

(( ) · ( |)) − () · ( |) − − () · ( |* ) − − (* ) ≥ ¯ * =

· ( |* ) − (* ) − ¯ Thus, with ( ) = − * , both the user and the developer receive the same payof as under full observability, with the user’s payof being * .

The intuition behind Proposition 2 is straightforward. When the system developer is risk-neutral, the need for risk-sharing mechanisms is eliminated, allowing for more eficient incentives. In this case, the developer can be fully compensated based on the marginal returns of their efort in aligning the AI system with ethical principles, without incurring any risk-bearing losses. For example, in the context of Constitutional AI, this implies that a risk-neutral developer can focus solely on embedding ethical principles - such as those outlined in a "constitution" - without being deterred by the risks associated with uncertain outcomes. The user as a principal can therefore provide direct incentives to reward the developer’s efort in achieving higher levels of ethical alignment, leading to a more transparent and accountable AI system. Since the developer is indiferent to risk, the compensation structure can be fully aligned with the ethical objectives, enabling a smoother implementation of Constitutional AI without the need to factor in risk-related adjustments.

3. Results

Our principal-agent model identifies the optimal reward scheme for system developers to align ethical objectives under specific conditions. In the case where the ethical alignment level is unobservable and the developer is risk-neutral, the optimal contract leads to the same ethical alignment choice and expected utilities for both the developer and the user as if the ethical alignment level were observable. When the ethical alignment level is observable, the optimal contract specifies a fixed reward for the system developer. This is uniquely optimal if the developer is strictly risk-averse. However, if the developer is risk-neutral, a fixed reward scheme is one of several possible optimal rewards. Furthermore, if users desire high levels of ethical alignment in AI systems, they must ofer greater compensation to system developers, as higher ethical alignment comes with increased efort and costs for developers. This trade-of is particularly relevant in practical scenarios where ethical considerations are paramount, such as in Constitutional AI frameworks. In these cases, users play a key role in incentivizing developers to achieve robust ethical standards by providing the necessary financial and contractual incentives. Ultimately, the model highlights the importance of economic incentives in balancing responsibilities between users and developers in the creation of ethical AI systems.

4. Discussion & Conclusion

This research demonstrates that economic incentives play a crucial role in ensuring the ethical alignment of AI systems through a reward scheme in a contract. Our findings emphasize that achieving higher levels of ethical alignment, such as those seen in Constitutional AI, requires greater compensation for system developers due to the increased efort, complexity, and resources involved. However, even after developers align AI systems with ethical objectives, users share the responsibility of ensuring these systems are deployed and utilized ethically. This includes adhering to the system’s design specifications, regularly monitoring AI outputs and behavior to prevent deviations from ethical standards, and complying with regulatory frameworks like the EU AI Act [9, 10]. If users identify unethical outcomes - such as biased decisions - they must take corrective actions, whether by adjusting the system’s parameters or collaborating with developers to address the issue. Ethical AI is a shared responsibility, not solely resting on developers. Users must also maintain ongoing oversight to ensure that AI continues to operate in alignment with ethical principles throughout its lifecycle, particularly as it interacts with new environments and data.

Our adaptation of the principal-agent model provides a theoretical framework that is both relevant and applicable to current discussions on AI governance. By aligning economic incentives with ethical outcomes, this model ofers insights that can inform regulatory approaches, such as those proposed in the EU AI Act, ensuring that system developers and users are both held accountable. This shared responsibility can enhance compliance with ethical standards, particularly as the complexity of AI systems increases.

While our model ofers valuable insights, it is important to acknowledge its limitations. For instance, it assumes developers are fully rational and respond predictably to incentives, which may not always hold true in practice. Additionally, the model does not account for other factors that could influence ethical alignment, such as societal pressures or rapidly evolving technological landscapes.

In conclusion, this research contributes to the ongoing dialogue about responsibility for ethical AI and how it should be distributed between developers and users. By ofering a concrete economic model, it helps clarify how incentives can be structured to promote ethical AI while ensuring that all stakeholders - developers, users, and regulators - actively maintain ethical standards. This is critical for building trust and accountability as AI becomes increasingly integrated into society.

Future research should explore more complex and dynamic incentive structures, including multiple principals, as well as ways to incorporate factors such as societal pressures, evolving regulations, and technological advancements like AI’s increasing autonomy and adaptability into the framework.

Acknowledgments

The authors thank Maria Bigoni and Nicola Meccheri for useful comments, and acknowledge support from the project "Teaming-up with social artificial agents" funded by Italian Ministry of Education, University and Research under the Program for Research Projects of National Interest (PRIN) grant no 2022ALBSWX.

[1]

Polignano ,

Musto ,

Pellungrini , E. Purificato, G. Semeraro,

Setzu , XAI.it 2024 : An Overview on the Future of Explainable AI in the era of Large Language Models , in: Proceedings of 5th Italian Workshop on Explainable Artificial Intelligence, co-located with the 23rd International Conference of the Italian Association for Artificial Intelligence , Bolzano, Italy, November 25-28 , 2024 , CEUR. org, 2024 .

[2]

Hadfield-Menell ,

G. K.

Hadfield , Incomplete contracting and ai alignment , Proceedings of the 2019 AAAI/ACM Conference on AI, Ethics, and Society (AIES '19) ( 2019 ) 417 - 422 . doi: 10 .1145/ 3306618.3314250.

[3]

Shavit ,

Agarwal ,

Brundage ,

Adler , C. O'Keefe , R.

Campbell , T.

Lee , P.

Mishkin , S.

Eloundou , A.

Hickey , K.

Slama , L.

Ahmada , P.

McMillan , A.

Beutel , A.

Passos , D. G.

Robinson , Practices for governing agentic ai systems , 2023 . URL: https://openai.com/index/ practices -for-governing-agentic-ai-systems/ , last accessed 2024 /07/25.

[4]

Phelps ,

R. E.

Ranson , Of models and tin men - a behavioural economics study of principal-agent problems in ai alignment using large-language models , ArXiv abs/2307 .11137 ( 2023 ). doi: 10 . 48550/arXiv.2307.11137.

[5]

Zhuang , D. Hadfield-Menell, Consequences of misaligned ai , in: Thirty-Fourth International Conference on Neural Information Processing Systems (NeurIPS) , 2020 , pp. 1 - 14 . doi: 10 .48550/ arXiv.2102.03896.

[6]

Bai ,

Kadavath ,

Kundu ,

Askell ,

Kernion ,

Jones ,

Chen ,

Goldie ,

Mirhoseini ,

McKinnon ,

Chen ,

Olsson ,

Olah ,

Hernandez ,

Drain ,

Ganguli ,

Li ,

TranJohnson , E. Perez,

Kerr ,

Mueller ,

Ladish ,

Landau ,

Ndousse ,

Lukosuite ,

Lovitt ,

Sellitto ,

Elhage ,

Schiefer ,

Mercado , N. DasSarma, R. Lasenby,

Larson ,

Ringer ,

Johnston ,

Kravec ,

S. E.

Showk ,

Fort ,

Lanham , T. Telleen-Lawton, T.

Conerly , T.

Henighan , T.

Hume , S. R.

Bowman , Z.

Hatfield-Dodds , B.

Mann , D.

Amodei , N.

Joseph , S.

McCandlish , T.

Brown , J. Kaplan, Constitutional ai: Harmlessness from ai feedback , 2022 . URL: https://arxiv. org/abs/2212.08073.