=Paper=
{{Paper
|id=Vol-2640/paper_15
|storemode=property
|title=Aligning with Heterogenous Preferences for Kidney Exchange
|pdfUrl=https://ceur-ws.org/Vol-2640/paper_15.pdf
|volume=Vol-2640
|authors=Rachel Freedman
|dblpUrl=https://dblp.org/rec/conf/ijcai/Freedman20
}}
==Aligning with Heterogenous Preferences for Kidney Exchange==
Aligning with Heterogeneous Preferences for Kidney Exchange Rachel Freedman University of California, Berkeley rachel.freedman@berkeley.edu Abstract conflicting moral preferences, and AI algorithms responsible for making decisions on behalf of these heterogeneous groups AI algorithms increasingly make decisions that im- must aggregate and arbitrate between these preferences. pact entire groups of humans. Since humans tend Many existing approaches to preference aggregation for AI to hold varying and even conflicting preferences, rely on determining a single representative objective or deci- AI algorithms responsible for making decisions sion for the AI to implement [Lee et al., 2018; Zhang and on behalf of such groups encounter the problem Conitzer, 2019; Conitzer et al., 2015]. However, humans of preference aggregation: combining inconsis- are known for their variable and contradictory preferences, tent and sometimes contradictory individual pref- meaning that many individuals will hold preferences that dif- erences into a representative aggregate. In this pa- fer greatly from the mean. Better techniques are required per, we address this problem in a real-world public to model such heterogeneous human preferences, implement health context: kidney exchange. The algorithms them in AI algorithms, and measure their satisfaction in prac- that allocate kidneys from living donors to patients tice. needing transplants in kidney exchange matching markets should prioritize patients in a way that One domain in which this is particularly apparent is that aligns with the values of the community they serve, of kidney exchange. In a kidney exchange, patients who but allocation preferences vary widely across in- need kidney transplants and have found willing but medi- dividuals. In this paper, we propose, implement cally incompatible donors are matched and exchange kidneys and evaluate a methodology for prioritizing pa- with other such incompatible patient-donor pairs [Roth et al., tients based on such heterogeneous moral prefer- 2004]. Many countries, including the United States [Dicker- ences. Instead of selecting a single static set of son and Sandholm, 2015], the United Kingdom [Manlove and patient weights, we learn a distribution over pref- O’Malley, 2015] and much of Europe [Biró et al., 2017] use erence functions based on human subject responses algorithms developed in the AI community to automate this to allocation dilemmas, then sample from this dis- matching. Since the prognosis for patients who do not receive tribution to dynamically determine patient weights kidney transplants is quite poor, these automated decisions during matching. We find that this methodology in- have life-or-death consequences and great moral import. It creases the average rank of matched patients in the is therefore vital that these allocation decisions are made in sampled preference ordering, indicating better sat- a way that aligns with societal values. Previous work has isfaction of group preferences. We hope that this sought to learn a single static utility function that kidney allo- work will suggest a roadmap for future automated cation algorithms can use to prioritize certain types of match- moral decision making on behalf of heterogeneous ing [Freedman et al., 2020]. However, that work disregards groups. the empirical heterogeneity in human ethical judgements in this domain. We seek to instead model this heterogeneity by developing a methodology that represents the full distribution 1 Introduction of human judgements. In this work, we draw on preference aggregation and social As AI algorithms become increasingly powerful and more welfare theory to design, implement and evaluate a methodol- widely deployed, it is vital that they act in a way that aligns ogy for autonomously allocating kidneys to patients in match- with human values. Unfortunately, in most real-world do- ing markets based on the heterogeneous moral preferences mains, people do not unanimously agree on a single set of expressed by surveyed human participants. We propose an “human values” that AI algorithms can model and instanti- alternate model for aggregating preferences drawn from the ate. Instead, groups of humans tend to hold varying and even economics literature and an alternate domain-specific mea- Copyright c 2020 for this paper by its authors. Use permitted sure of group welfare based on individual preference rank- under Creative Commons License Attribution 4.0 International (CC ings. We show that our proposed model, which aggregates in- BY 4.0). dividual preferences into a distribution over utility functions Profile Age Drinking Cancer 1 30 rare healthy 2 30 frequently healthy 3 30 rare cancer 4 30 frequently cancer Figure 1: Example compatibility graph. Donor and patient blood types are in parentheses and arrows indicate possible valid dona- 5 70 rare healthy tions. This graph has two valid donation cycles: cA = {v1 , v2 } and cB = {v2 , v3 }. However, both contain v2 , so only one set of 6 70 frequently healthy donations can take place. 7 70 rare cancer 8 70 frequently cancer instead of a single function, improves the average rank of the matched kidney donations in individuals’ preference or- Table 1: Patient profile descriptions enumerated and (arbitrarily) derings, without reducing the number of patients that can be numbered by Freedman et al. matched overall. Incorporating the model into this real-world AI system leads to more beneficial outcomes according to our proposed measure of social welfare. and difficult to approximate [Biro and Cechlarova, 2007], so We hope that this work will both highlight the preference this problem is typically solved by formulating it as a lin- aggregation challenges present in many allocative AI sys- ear program (LP) and solving it with an LP-solver such as tems, and serve as a roadmap for developing systems that CPLEX. directly address these challenges in other real-world contexts. We typically assign a weight we to each edge e to rep- resent the utility of that particular donation taking place. 2 Kidney Exchanges In the national US exchange, these weights are set ad-hoc by a committee [UNOS, 2015], but in this work we will 2.1 Graph Formulation adapt an alternative method that learns these weights based In a kidney exchange, patients who need a kidney transplant on human responses to allocation dilemmas [Freedman et and donors who are willing to donate to them but are med- al., 2020]. The clearing house problem is to find the op- ically incompatible can be matched with other such patient- timal matching M ∗ which maximizes some utility function donor pairs [Roth et al., 2004]. For example, if the donor of u : M → R. This is typically formalized as the graph- pair i is compatible with the patient of pair j, and the donor theoretic problemP of finding the maximum weighted cycle of pair j is likewise compatible with the patient of pair i, they P cover u(M ) = c∈M e∈c we . To solve this via linear pro- can form a matching, in which donor di donates to patient pj gramming, let C(L) Pbe the set of all cycles of length no more and donor dj donates to patient pi . than L, let wc = c∈C(L) we , create an activation variable In the standard formulation, a kidney exchange is described xc ∈ 0, 1 for each cycle c ∈ C(L), then solve the following by a compatibility graph G = hV, Ei [Roth et al., 2004; linear program: Roth et al., 2005]. We construct one vertex v for each patient- donor pair, then add a directed edge ei,j from vi to vj if di X X is compatible with pj . A cycle c is a possible sequence of max wc xc s.t. xc ≤ 1 ∀v ∈ V. (1) valid transplants, in which each donor in the cycle donates c∈C(L) c:v∈c a kidney and each patient receives one. A matching M is a set of disjoint cycles. An example compatibility graph is using an LP-solver such as CPLEX. The final matching M shown in Figure 1. Each oval in the figure represents a vertex, is the set of cycles c with an activation xc = 1. If all edge and each arrow represents a directed edge signifying donor- weights we are equal, then solving this LP gives a maximum- patient compatibility. There are two cycles in this particu- cardinality matching. In cases where there are multiple valid lar compatibility graph: cA = {v1 , v2 } and cB = {v2 , v3 }. matchings of equivalent cardinality, such as MA and MB in However, these two cycles are not disjoint, because they share Figure 1, this LP-solver must choose between them randomly. vertex v2 . The v2 donor cannot donate both of their kidneys, However, if the edge weights are set according to some util- so these exchanges cannot both take place. This compatibil- ity function, then the solution can prioritize certain types of ity graph therefore has two valid matchings, MA = {cA } and matches. This allows us to incorporate societal preferences. MB = {cB }, each with cardinality 2. 2.3 Incorporating Preferences 2.2 Clearing Algorithm Previous work has attempted to improve the matching priori- The clearing house problem in kidney exchange is to find tization in kidney exchanges based on sampled human ethical the optimal valid matching, according to some utility func- preferences [Freedman et al., 2020]. All else equal, it is ob- tion [Abraham et al., 2007]. Finding valid matchings with a viously morally preferable to save lives by matching as many finite limit on cycle lengths is NP-hard [Abraham et al., 2007] patient-donor pairs as possible. However, in cases such as the one in Figure 1, there can be multiple maximum-cardinality as profile 8), will therefore have the lowest scores. Freedman matchings. In this case, the algorithm requires a utility func- et al. use this model to estimate a single set of scores based tion that distinguishes between them, ideally in a way that on the pooled pairwise comparisons from every survey re- aligns with human values. The US national kidney exchange spondent. This allows them to aggregate all preferences into attempts to do this, but they prioritize matches in an opaque a single set of scores. and ad-hoc fashion via committee [UNOS, 2015]. This ex- They then revised the LP from Eq 1, setting the weight cludes most of the societal members who will actually par- of each edge ei,j to be the BT-score of the recipient pj and ticipate in the exchange from the discussion, and leaves the adding a cardinality constraint to require that the LP still committee with the still-unsolved problem of designing a util- only produce maximum-cardinality matchings. Let wBT (v) ity function that captures the ethical preferences of an entire be the BT-score of the patient in vertex v, and let Q be the society. Freedman et al. propose an alternative methodology maximum matching cardinality possible for the compatibility for learning domain-relevant ethical preferences from actual graph. Then the revised LP is: human decisions in kidney allocation dilemmas, revising the LP in Eq 1 to take these into account, and then evaluating the P hP i impact on a simulated exchange. Our work proposes an im- max wBT (v) xc Pc∈C(L) (u,v)∈c (3) provement on Freedman et al.’s methodology for aggregating s.t. Pc:v∈c xc ≤ 1 ∀v ∈ V preferences and evaluating results, so we will briefly outline c∈C(L) |c|xc ≥ Q their full methodology here. Freedman et al. conducted two surveys on participants They evaluated this revised algorithm on a simulated kid- from Amazon’s Mechanical Turk platform (“MTurk”). The ney exchange and found that it matched significantly more first survey asked participants (N = 100) to read a brief of the higher-scoring profiles and significantly fewer of the description of the kidney transplant waiting list process, and lower-scoring ones. However, this methodology relies on the then asked them to propose which patient characteristics they assumption that societal preferences are sufficiently homoge- thought “morally ought” to be used to prioritize patients. neous to be captured by a single static utility function. An The three most popular categories of responses were “age”, algorithm using this methodology will always choose to save “health – behavioral” (including aspects of health believed a patient of profile 1 over a patient of profile 2. However, to be under personal control, such as diet and drinking), and the preferences expressed in the survey data actually varied “health – general” (including aspects of health unrelated to greatly, and participants did sometimes prefer patients of pro- kidney disease). The second survey asked a new set of par- file 2 to profile 1. Presumably the preferences of a representa- ticipants (N = 289) to decide how to allocate kidneys be- tive sample of the actual US population would be even more tween pairs of fictional patient “profiles” that vary according heterogeneous. In this sense, both the static profile scoring to these attributes. In order to make the profiles more con- and the assessment of the algorithm by the proportion of each crete, drinking behavior (“1 alcoholic drink per month” or “5 profile matched are flawed. In this work, we improve upon alcoholic drinks per day”) was used as a proxy for behav- both of these elements by removing the requirement for a sin- ioral health, and cancer (“skin cancer in remission” or “no gle utility function and developing an alternate methodology other major health problems”) was used as a proxy for general for modifying and evaluating the algorithm. health. For example, a sample question asked participants to choose between “Patient W.A. [who] is 30 years old, had 1 al- 3 Incorporating Heterogeneous Preferences coholic drink per month (prior to diagnosis), and has no other In our work we improve upon the methodology presented in major health problems” and “Patient R.F. [who] is 70 years Section 2.3 by removing the unrealistic assumption that soci- old, had 5 alcoholic drinks per day (prior to diagnosis), and etal preferences can be captured by a single utility function. has skin cancer in remission”. They defined 8 such patient We propose 1) an alternative preference aggregation method profiles, with characteristics described in Table 1, and asked that better captures the variation in expressed preferences, 2) each participant to compare each pair of profiles. We will use modifications to the kidney allocation algorithm to take this these profile descriptions and this preference data in our own new preference aggregation into account, and 3) an alterna- work. tive evaluation metric for the resulting matchings that lends Freedman et al. used the Bradley-Terry Model (“BT more consideration to individual welfare. Model”) to estimate a “BT-score” for each patient profile. The BT model assumes that each profile x has an underly- 3.1 Preference Aggregation Model: BLP ing score px that represents the value that survey participants Instead of learning a single score for each profile as in collectively place on donating to a patient with that profile. previous work [Freedman et al., 2020], we use the Berry- Under this model, the probability that profile i will be pre- Levinsohn-Pakes Model (“BLP Model”) to estimate a distri- ferred to profile j is: bution over possible utility functions. We propose that learn- pi ing and sampling from this distribution better satisfies in- P (i > j) = (2) dividual preferences than learning a single utility function. pi + pj The BLP model is an extension of the logit discrete choice Patient profiles that are almost always selected by our survey model that is widely used in estimating consumer discrete- participants (such as profile 1 in Table 1) will therefore have choice demand for differentiated products [Berry et al., 1995; the highest scores, and profiles that are rarely selected (such Nevo, 2001]. When we apply this model to kidney exchange, the “consumers” are members of the population that the ex- functions as a proxy for individual welfare because it repre- change serves, and the “products” are the patients who may sents the extent to which an individual’s domain-relevant val- potentially be matched with donors. Using this model allows ues were fulfilled. We claim that the average rank of matched us to predict how the general population wants the exchange donations is a better measure of the extent to which an algo- to prioritize patients. rithm values individual welfare than the proportion of each For a graph G = hV, Ei, we wish to define a utility func- profile matched because the ranks of all matches depend on tion U : V → R that determines the utility of the pa- the full BLP distribution. In contrast, the proportion matched tient in each vertex receiving a utility function. The BLP measure relies on the false assumption that everyone’s pref- > model defines the utility function U(v) = Xp(v) β + where erences are better satisfied if patients with higher BT-scores Xp(v) = {1(vage = 30), 1(vdrinking = rare), 1(vcancer = are matched more often. healthy)} are the binary features of the patient profile of ver- We run both algorithms on a simulated kidney exchange, tex v, β ∼ N (µ, Σ) gives the weight of each feature, and is along with a third algorithm that weights all donations equally an error term following a type-II extreme value distribution. as a baseline. We evaluate the resulting matchings both on the We use maximum likelihood estimation to fit the distribu- proportion of each profile matched, and on our proposed aver- tion parameters (µ, Σ) to the pairwise comparison survey data age rank measure. We find that our proposed algorithm con- gathered by Freedman et al. Let P be the set of all patient sistently outperforms both others on the rank measure, sug- profiles described in Table 1, and for each pair of profiles gesting that it better represents the full distribution of societal i, j ∈ P , let ck (i, j) be survey respondent k’s preferred pro- preferences. file. This allows us to define the likelihood function 4 Experiments 4.1 Conditions Y exp(Xc>k (i,j) β) Lk (µ, Σ | ck ) = Eβ∼N (µ,Σ) We tested three versions of the matching algorithm: the base- i,j exp(Xi> β) + exp(Xj> β) line one that weights all donations equally, the one with a (4) single utility function described in Section 2.3, and one with and to estimate the maximum likelihood distribution parame- a distribution over utility functions proposed in Section 3.1. ters Condition 1: EQUAL The EQUAL algorithm matches kid- N ney exchange participants using the LP in Eq 1. That is, 1 X it weights all participants equally and chooses randomly µ̂, Σ̂ = argmax log(Lk (µ, Σ | ck )) (5) amongst the highest-cardinality matchings. We use this con- µ,Σ N k=1 dition as a baseline because it describes the case in which 3.2 Algorithm ethical preferences are not incorporated into the algorithm at all. Each time a new patient-donor pair enters the exchange, we add a corresponding vertex u to the graph, randomly sample Condition 2: HOMOGENEOUS The HOMOGENEOUS algo- a βu ∼ N (µ̂, Σ̂) from the learned distribution, and weight rithm matches participants using the LP in Eq 3. It assigns outgoing edges u → v using the resulting “BLP function”: edge weights based on the BT-score of the recipient, relying BLP (u, v) = Xp(v) > βu + uv . In this way, we represent the on the assumption that individual preferences are sufficiently homogeneous to be captured by a single static utility function. full distribution of preferences. Note that the BLP function This is the algorithm proposed by Freedman et al.. See Ta- indicates a random sample from the surveyed population’s ble 2 for the weights used in the EQUAL and HOMOGENEOUS preference distribution – it does not represent the preferences conditions. of donor u specifically. Letting wBLP (u,v) be the score that vertex u’s sampled BLP function places on donating to the Condition 3: HETEROGENEOUS The HETEROGENEOUS patient in vertex v, we modify the LP in Eq 3 to be: algorithm matches participants using the LP in Eq 6. It sam- ples a BLP function when each vertex is added to the graph, P hP i normalizes the scores produced by that function to the range max c∈C(L) (u,v)∈c wBLP (u,v) xc [0, 1], and uses that function and the profile of the recipient to weight each new outgoing edge. This allows for the pos- P s.t. Pc:v∈c xc ≤ 1 ∀v ∈ V c∈C(L) |c|xc ≥ Q sibility that heterogeneous individual preferences are better (6) captured by a distribution than by a single utility function. This is the novel algorithm that we propose in this work. 3.3 Evaluation Metric 4.2 Measures We further define the rank of a donation u → v to be the po- sition of v’s patient profile in the preference ordering induced We evaluate each algorithm according to both the measure we by u’s BLP function. For example, if the BLP function asso- propose in Section 3 and the measure used by Freedman et al. ciated with vertex u weights the profile of the patient in vertex Average Rank The average rank of a matching is the aver- v above all other profiles, rank(u, v) = 1. Conversely, if the age rank of each donation in the matching, where rank(u, v) BLP function weights the profile below the other seven pos- of a donation u → v is as defined in Section 3.3. Recall sible patient profiles, rank(u, v) = 8. In this context, rank that lower ranks indicate higher preference satisfaction and, Profile ID EQUAL HOMOGENEOUS 1 1.000 1.000 2 1.000 0.103 3 1.000 0.236 4 1.000 0.036 5 1.000 0.070 6 1.000 0.012 7 1.000 0.024 8 1.000 0.003 Table 2: Patient profile weights for the EQUAL and HOMOGENEOUS experimental conditions. The EQUAL algorithm values all profiles equally, so all have weight 1.0. However, the HOMOGENEOUS algorithm weights profiles according to their BT-scores. The HETEROGENEOUS algorithm samples BLP functions throughout Figure 2: Average rank of donations in each simulated run matching and so does not have a static weight for each profile. (N = 50). Lower ranks indicate more-preferred matches. The HETEROGENEOUS algorithm produces the lowest average ranks (median = 3.24), followed by the HOMOGENEOUS algorithm (me- since there are 8 profiles, all possible ranks fall in the range dian = 3.66), then the EQUAL algorithm (median = 4.06). [1.0, 8.0]. For each run of each algorithm, we recorded the av- erage rank of every matching in the simulation, then averaged these to get the average rank value for that algorithm. HOMOGENEOUS algorithm produces the next-lowest aver- age rank, followed by the EQUAL algorithm, which produces Proportion Matched The proportion matched of a profile the highest average rank. This is because the EQUAL algo- is the proportion of patients of that type that entered the kid- rithm weights all edges equally, matching recipients without ney exchange pool and were subsequently matched. A pro- any consideration of the personal characteristics used to de- portion matched of 100% means that all patients of that type fine their weight. The HOMOGENEOUS algorithm improves were matched, and a proportion matched of 0% means that upon this by considering the characteristics of donation re- none were. For each run of each algorithm, we recorded the cipients, but fails to approximate preferences as closely as number of patients with each profile that entered the pool and HETEROGENEOUS. the number of patients of each profile that were eventually matched, and used this to calculate proportion matched. Proportion Matched Freedman et al. quantified the im- pact of their modified algorithm by comparing the propor- 4.3 Experimental Setup tions of patients of each profile type matched by their al- We built a simulator1 to mimic daily matching using the gorithm against the proportions matched by the unmodified EQUAL, HOMOGENEOUS, or HETEROGENEOUS algorithms algorithm, so for the sake of comparison we do the same. based on previously developed tools [Dickerson and Sand- The proportions of each profile matched by the EQUAL al- holm, 2015; Freedman et al., 2020]. Each simulated “day”, gorithm, the HOMOGENEOUS algorithm (proposed by Freed- some pairs enter the pool, some pairs exit the pool, and then man et al.) and the HETEROGENEOUS algorithm (proposed the matching algorithm is run on the pairs that remain. The in this work) are shown in Figure 3. unmatched pairs remain in the pool to potentially be matched Since it doesn’t take patient profiles into account, the in the future. For each algorithm, we executed 50 runs of 5 EQUAL algorithm matched approximately the same percent- years of simulated daily matching, and recorded the average age of patients across all profiles. Since it prioritizes pa- matching rank and profile proportions matched for each run. tients solely based on profile, the HOMOGENEOUS algorithm matched the more popular profiles (1-3) more often and the less popular profiles (4-8) less. Notably, the HOMOGENEOUS 5 Results algorithm almost always matches patients of profile 1, indi- Average Rank Since lower donation ranks indicate that cating that a patient’s profile can be one of the major factors the recipient is higher in the sampled preference ordering, in determining whether they receive a kidney. However, the we propose that algorithms that induce lower average ranks HETEROGENEOUS algorithm prioritizes patients not directly better satisfy population preferences. As expected, the pro- based on their profile, but based on the sampled BLP func- posed HETEROGENEOUS algorithm consistently produces tion’s valuation of their profile. As a result, this algorithm matchings with the lowest average rank (Figure 2). The still tends to match more of the commonly-preferred profiles and fewer of the commonly-dispreferred ones, but sometimes 1 samples a BLP function from the tails of the distribution that All code for this paper can be found in the Variation pack- age of github.com/RachelFreedman/KidneyExchange prioritizes patient profiles very differently. of weighting all potential kidney recipients equally, deciding how to prioritize them in an opaque and ad-hoc way [UNOS, 2015], or prioritizing them based on a single static utility function [Freedman et al., 2020], we proposed learning a dis- tribution over surveyed preferences and then sampling from this distribution for dynamic weighting during matching. We furthermore proposed donation rank as a better measure of preference satisfaction. We implemented our proposed algo- rithm and compare it to predecessor algorithms on a kidney exchange simulation, finding that our algorithm better satis- fies survey participant preferences. Our model was estimated based on preference data elicited from MTurk survey participants, who are assuredly not rep- resentative of society in general. Future work should elicit preferences from a more representative sample, and perhaps privilege preferences expressed by domain experts and stake- Figure 3: Proportion of patients of each profile matched in each sim- holders such as doctors, policy-makers and kidney exchange ulated run (N = 50). All algorithms match approximately 62% participants. However, we believe that our sample was not of patients overall, but the HOMOGENEOUS algorithm dispropor- more heterogeneous than the US population as a whole, so tionately matches profiles with higher BT-scores, the EQUAL algo- we expect the challenge of heterogeneity and our method- rithm matches all profiles approximately equally, and the propor- ology to continue to be relevant for this expanded sample. tions matched by the HETEROGENEOUS algorithm lie between the Moreover, since even a representative sample of the general other two. public would still lack relevant domain-specific knowledge about kidney transplants, future work should also investigate methodologies that allow domain experts to correct or mod- If the survey preferences had been completely homoge- erate the outcomes of this process. neous, then the HETEROGENEOUS algorithm would have We hope that the challenges highlighted and methodology produced the same results as the HOMOGENEOUS algo- prototyped in this work will suggest a roadmap for develop- rithm. However, because preferences expressed in the sur- ing techniques for automating moral decision making in other vey data sometimes differ from the utility function used for domains. the HOMOGENEOUS algorithm, sometimes different matches are made. For example, while most survey participants pre- ferred patient profile 1 to all other profiles, some did not, so Acknowledgements the HETEROGENEOUS algorithm respects this heterogeneity We thank Yunhao Huang for implementing the BLP model, by sometimes prioritizing matching other profiles over profile the Duke Moral AI team for sharing the human subject data, 1. This difference in matching is a further indication that our Dr. John Dickerson for building an earlier version of the kid- proposed algorithm more faithfully represents the full distri- ney exchange simulation, and Dr. Peter Eckersley for early bution over preferences. discussions of the idea. 6 Discussion References Faithfully instantiating the collective values of groups with [Abraham et al., 2007] David J Abraham, Avrim Blum, and heterogeneous individual preferences is a frequent challenge Tuomas Sandholm. Clearing algorithms for barter ex- for real-world AI systems. For example, we commonly change markets: Enabling nationwide kidney exchanges. use AI systems to allocate scarce resources – such as kid- In Proceedings of the 8th ACM conference on Electronic ney donors, food donations [Lee et al., 2018] and interview commerce, pages 295–304, 2007. slots [Schumann et al., 2019] – amongst group members in [Berry et al., 1995] Steven Berry, James Levinsohn, and a way that we hope maximizes group welfare. Moreover, Ariel Pakes. Automobile prices in market equilibrium. our roads may soon be populated with autonomous vehicles, Econometrica: Journal of the Econometric Society, pages which will have to make moral tradeoffs – such determin- 841–890, 1995. ing who to sacrifice in unavoidable collisions [Bonnefon et al., 2016] – based on the complex and often contradictory [Biro and Cechlarova, 2007] Peter Biro and Katarina Cech- moral frameworks of the communities in which they operate. larova. Inapproximability of the kidney exchange problem. It is therefore vital that the AI community develop techniques Information Processing Letters, 101(5):199–202, 2007. for faithfully aggregating such heterogeneous preferences and [Biró et al., 2017] Péter Biró, Lisa Burnapp, Bernadette use them to develop socially beneficial AI systems. Haase, Aline Hemke, Rachel Johnson, Joris van de Klun- In this paper, we proposed, instantiated, and evaluated one dert, and David Manlove. Kidney exchange practices such technique for incorporating heterogeneous ethical pref- in Europe, 2017. First Handbook of the COST Action erences into a specific real-world AI system: an algorithm CA15210: European Network for Collaboration on Kid- for matching patient-donor pairs in kidney exchange. Instead ney Exchange Programmes. [Bonnefon et al., 2016] Jean-François Bonnefon, Azim Shariff, and Iyad Rahwan. The social dilemma of autonomous vehicles. Science, 352(6293):1573–1576, 2016. [Conitzer et al., 2015] Vince Conitzer, Markus Brill, and Rupert Freeman. Crowdsourcing societal tradeoffs. In Proceedings of the 2015 international conference on au- tonomous agents and multiagent systems, pages 1213– 1217, 2015. [Dickerson and Sandholm, 2015] John P Dickerson and Tuomas Sandholm. Futurematch: Combining human value judgments and machine learning to match in dy- namic environments. In Twenty-Ninth AAAI Conference on Artificial Intelligence, 2015. [Freedman et al., 2020] Rachel Freedman, Jana Schaich Borg, Walter Sinnott-Armstrong, John P Dickerson, and Vincent Conitzer. Adapting a kidney exchange algorithm to align with human values. Artificial Intelligence, page 103261, 2020. [Lee et al., 2018] Min Kyung Lee, Daniel Kusbit, Anson Kahng, Ji Tae Kim, Xinran Yuan, Allissa Chan, Ritesh Noothigattu, Daniel See, Siheon Lee, Christos-Alexandros Psomas, et al. Webuildai: Participatory framework for fair and efficient algorithmic governance. Preprint, 2018. [Manlove and O’Malley, 2015] David Manlove and Gregg O’Malley. Paired and altruistic kidney donation in the UK: Algorithms and experimentation. ACM Journal of Experi- mental Algorithmics, 19(1), 2015. [Nevo, 2001] Aviv Nevo. Measuring market power in the ready-to-eat cereal industry. Econometrica, 69(2):307– 342, 2001. [Roth et al., 2004] Alvin E Roth, Tayfun Sönmez, and M Utku Ünver. Kidney exchange. The Quarterly journal of economics, 119(2):457–488, 2004. [Roth et al., 2005] Alvin E Roth, Tayfun Sönmez, et al. A kidney exchange clearinghouse in new england. American Economic Review, 95(2):376–380, 2005. [Schumann et al., 2019] Candice Schumann, Zhi Lang, Jef- frey Foster, and John Dickerson. Making the cut: A bandit-based approach to tiered interviewing. In Advances in Neural Information Processing Systems, pages 4641– 4651, 2019. [UNOS, 2015] UNOS. Revising kidney paired donation pi- lot program priority points, 2015. OPTN/UNOS Public Comment Proposal. [Zhang and Conitzer, 2019] Hanrui Zhang and Vincent Conitzer. A pac framework for aggregating agents’ judgments. In Proceedings of the AAAI Conference on Artificial Intelligence, volume 33, pages 2237–2244, 2019.