=Paper= {{Paper |id=Vol-2640/paper_15 |storemode=property |title=Aligning with Heterogenous Preferences for Kidney Exchange |pdfUrl=https://ceur-ws.org/Vol-2640/paper_15.pdf |volume=Vol-2640 |authors=Rachel Freedman |dblpUrl=https://dblp.org/rec/conf/ijcai/Freedman20 }} ==Aligning with Heterogenous Preferences for Kidney Exchange== https://ceur-ws.org/Vol-2640/paper_15.pdf
               Aligning with Heterogeneous Preferences for Kidney Exchange

                                                    Rachel Freedman
                                             University of California, Berkeley
                                              rachel.freedman@berkeley.edu



                          Abstract                                  conflicting moral preferences, and AI algorithms responsible
                                                                    for making decisions on behalf of these heterogeneous groups
     AI algorithms increasingly make decisions that im-             must aggregate and arbitrate between these preferences.
     pact entire groups of humans. Since humans tend                   Many existing approaches to preference aggregation for AI
     to hold varying and even conflicting preferences,              rely on determining a single representative objective or deci-
     AI algorithms responsible for making decisions                 sion for the AI to implement [Lee et al., 2018; Zhang and
     on behalf of such groups encounter the problem                 Conitzer, 2019; Conitzer et al., 2015]. However, humans
     of preference aggregation: combining inconsis-                 are known for their variable and contradictory preferences,
     tent and sometimes contradictory individual pref-              meaning that many individuals will hold preferences that dif-
     erences into a representative aggregate. In this pa-           fer greatly from the mean. Better techniques are required
     per, we address this problem in a real-world public            to model such heterogeneous human preferences, implement
     health context: kidney exchange. The algorithms                them in AI algorithms, and measure their satisfaction in prac-
     that allocate kidneys from living donors to patients           tice.
     needing transplants in kidney exchange matching
     markets should prioritize patients in a way that                  One domain in which this is particularly apparent is that
     aligns with the values of the community they serve,            of kidney exchange. In a kidney exchange, patients who
     but allocation preferences vary widely across in-              need kidney transplants and have found willing but medi-
     dividuals. In this paper, we propose, implement                cally incompatible donors are matched and exchange kidneys
     and evaluate a methodology for prioritizing pa-                with other such incompatible patient-donor pairs [Roth et al.,
     tients based on such heterogeneous moral prefer-               2004]. Many countries, including the United States [Dicker-
     ences. Instead of selecting a single static set of             son and Sandholm, 2015], the United Kingdom [Manlove and
     patient weights, we learn a distribution over pref-            O’Malley, 2015] and much of Europe [Biró et al., 2017] use
     erence functions based on human subject responses              algorithms developed in the AI community to automate this
     to allocation dilemmas, then sample from this dis-             matching. Since the prognosis for patients who do not receive
     tribution to dynamically determine patient weights             kidney transplants is quite poor, these automated decisions
     during matching. We find that this methodology in-             have life-or-death consequences and great moral import. It
     creases the average rank of matched patients in the            is therefore vital that these allocation decisions are made in
     sampled preference ordering, indicating better sat-            a way that aligns with societal values. Previous work has
     isfaction of group preferences. We hope that this              sought to learn a single static utility function that kidney allo-
     work will suggest a roadmap for future automated               cation algorithms can use to prioritize certain types of match-
     moral decision making on behalf of heterogeneous               ing [Freedman et al., 2020]. However, that work disregards
     groups.                                                        the empirical heterogeneity in human ethical judgements in
                                                                    this domain. We seek to instead model this heterogeneity by
                                                                    developing a methodology that represents the full distribution
1   Introduction                                                    of human judgements.
                                                                       In this work, we draw on preference aggregation and social
As AI algorithms become increasingly powerful and more
                                                                    welfare theory to design, implement and evaluate a methodol-
widely deployed, it is vital that they act in a way that aligns
                                                                    ogy for autonomously allocating kidneys to patients in match-
with human values. Unfortunately, in most real-world do-
                                                                    ing markets based on the heterogeneous moral preferences
mains, people do not unanimously agree on a single set of
                                                                    expressed by surveyed human participants. We propose an
“human values” that AI algorithms can model and instanti-
                                                                    alternate model for aggregating preferences drawn from the
ate. Instead, groups of humans tend to hold varying and even
                                                                    economics literature and an alternate domain-specific mea-
    Copyright c 2020 for this paper by its authors. Use permitted   sure of group welfare based on individual preference rank-
under Creative Commons License Attribution 4.0 International (CC    ings. We show that our proposed model, which aggregates in-
BY 4.0).                                                            dividual preferences into a distribution over utility functions
                                                                                 Profile     Age      Drinking       Cancer
                                                                                     1       30          rare        healthy
                                                                                     2       30       frequently     healthy
                                                                                     3       30          rare        cancer
                                                                                     4       30       frequently     cancer
Figure 1: Example compatibility graph. Donor and patient blood
types are in parentheses and arrows indicate possible valid dona-                    5       70          rare        healthy
tions. This graph has two valid donation cycles: cA = {v1 , v2 }
and cB = {v2 , v3 }. However, both contain v2 , so only one set of                   6       70       frequently     healthy
donations can take place.                                                            7       70          rare        cancer
                                                                                     8       70       frequently     cancer
instead of a single function, improves the average rank of
the matched kidney donations in individuals’ preference or-          Table 1: Patient profile descriptions enumerated and (arbitrarily)
derings, without reducing the number of patients that can be         numbered by Freedman et al.
matched overall. Incorporating the model into this real-world
AI system leads to more beneficial outcomes according to our
proposed measure of social welfare.                                  and difficult to approximate [Biro and Cechlarova, 2007], so
   We hope that this work will both highlight the preference         this problem is typically solved by formulating it as a lin-
aggregation challenges present in many allocative AI sys-            ear program (LP) and solving it with an LP-solver such as
tems, and serve as a roadmap for developing systems that             CPLEX.
directly address these challenges in other real-world contexts.         We typically assign a weight we to each edge e to rep-
                                                                     resent the utility of that particular donation taking place.
2     Kidney Exchanges                                               In the national US exchange, these weights are set ad-hoc
                                                                     by a committee [UNOS, 2015], but in this work we will
2.1    Graph Formulation                                             adapt an alternative method that learns these weights based
In a kidney exchange, patients who need a kidney transplant          on human responses to allocation dilemmas [Freedman et
and donors who are willing to donate to them but are med-            al., 2020]. The clearing house problem is to find the op-
ically incompatible can be matched with other such patient-          timal matching M ∗ which maximizes some utility function
donor pairs [Roth et al., 2004]. For example, if the donor of        u : M → R. This is typically formalized as the graph-
pair i is compatible with the patient of pair j, and the donor       theoretic problemP of finding the maximum weighted cycle
of pair j is likewise compatible with the patient of pair i, they
                                                                                             P
                                                                     cover u(M ) = c∈M e∈c we . To solve this via linear pro-
can form a matching, in which donor di donates to patient pj         gramming, let C(L) Pbe the set of all cycles of length no more
and donor dj donates to patient pi .                                 than L, let wc = c∈C(L) we , create an activation variable
   In the standard formulation, a kidney exchange is described       xc ∈ 0, 1 for each cycle c ∈ C(L), then solve the following
by a compatibility graph G = hV, Ei [Roth et al., 2004;              linear program:
Roth et al., 2005]. We construct one vertex v for each patient-
donor pair, then add a directed edge ei,j from vi to vj if di                X                             X
is compatible with pj . A cycle c is a possible sequence of           max            wc xc     s.t.                xc ≤ 1   ∀v ∈ V. (1)
valid transplants, in which each donor in the cycle donates                 c∈C(L)                         c:v∈c
a kidney and each patient receives one. A matching M is
a set of disjoint cycles. An example compatibility graph is          using an LP-solver such as CPLEX. The final matching M
shown in Figure 1. Each oval in the figure represents a vertex,      is the set of cycles c with an activation xc = 1. If all edge
and each arrow represents a directed edge signifying donor-          weights we are equal, then solving this LP gives a maximum-
patient compatibility. There are two cycles in this particu-         cardinality matching. In cases where there are multiple valid
lar compatibility graph: cA = {v1 , v2 } and cB = {v2 , v3 }.        matchings of equivalent cardinality, such as MA and MB in
However, these two cycles are not disjoint, because they share       Figure 1, this LP-solver must choose between them randomly.
vertex v2 . The v2 donor cannot donate both of their kidneys,        However, if the edge weights are set according to some util-
so these exchanges cannot both take place. This compatibil-          ity function, then the solution can prioritize certain types of
ity graph therefore has two valid matchings, MA = {cA } and          matches. This allows us to incorporate societal preferences.
MB = {cB }, each with cardinality 2.
                                                                     2.3   Incorporating Preferences
2.2    Clearing Algorithm                                            Previous work has attempted to improve the matching priori-
The clearing house problem in kidney exchange is to find             tization in kidney exchanges based on sampled human ethical
the optimal valid matching, according to some utility func-          preferences [Freedman et al., 2020]. All else equal, it is ob-
tion [Abraham et al., 2007]. Finding valid matchings with a          viously morally preferable to save lives by matching as many
finite limit on cycle lengths is NP-hard [Abraham et al., 2007]      patient-donor pairs as possible. However, in cases such as the
one in Figure 1, there can be multiple maximum-cardinality        as profile 8), will therefore have the lowest scores. Freedman
matchings. In this case, the algorithm requires a utility func-   et al. use this model to estimate a single set of scores based
tion that distinguishes between them, ideally in a way that       on the pooled pairwise comparisons from every survey re-
aligns with human values. The US national kidney exchange         spondent. This allows them to aggregate all preferences into
attempts to do this, but they prioritize matches in an opaque     a single set of scores.
and ad-hoc fashion via committee [UNOS, 2015]. This ex-              They then revised the LP from Eq 1, setting the weight
cludes most of the societal members who will actually par-        of each edge ei,j to be the BT-score of the recipient pj and
ticipate in the exchange from the discussion, and leaves the      adding a cardinality constraint to require that the LP still
committee with the still-unsolved problem of designing a util-    only produce maximum-cardinality matchings. Let wBT (v)
ity function that captures the ethical preferences of an entire   be the BT-score of the patient in vertex v, and let Q be the
society. Freedman et al. propose an alternative methodology       maximum matching cardinality possible for the compatibility
for learning domain-relevant ethical preferences from actual      graph. Then the revised LP is:
human decisions in kidney allocation dilemmas, revising the
LP in Eq 1 to take these into account, and then evaluating the                 P        hP                 i
impact on a simulated exchange. Our work proposes an im-                max                        wBT (v)   xc
                                                                               Pc∈C(L)     (u,v)∈c
                                                                                                                              (3)
provement on Freedman et al.’s methodology for aggregating              s.t.   Pc:v∈c xc ≤ 1                      ∀v ∈ V
preferences and evaluating results, so we will briefly outline                  c∈C(L) |c|xc ≥ Q
their full methodology here.
   Freedman et al. conducted two surveys on participants             They evaluated this revised algorithm on a simulated kid-
from Amazon’s Mechanical Turk platform (“MTurk”). The             ney exchange and found that it matched significantly more
first survey asked participants (N = 100) to read a brief         of the higher-scoring profiles and significantly fewer of the
description of the kidney transplant waiting list process, and    lower-scoring ones. However, this methodology relies on the
then asked them to propose which patient characteristics they     assumption that societal preferences are sufficiently homoge-
thought “morally ought” to be used to prioritize patients.        neous to be captured by a single static utility function. An
The three most popular categories of responses were “age”,        algorithm using this methodology will always choose to save
“health – behavioral” (including aspects of health believed       a patient of profile 1 over a patient of profile 2. However,
to be under personal control, such as diet and drinking), and     the preferences expressed in the survey data actually varied
“health – general” (including aspects of health unrelated to      greatly, and participants did sometimes prefer patients of pro-
kidney disease). The second survey asked a new set of par-        file 2 to profile 1. Presumably the preferences of a representa-
ticipants (N = 289) to decide how to allocate kidneys be-         tive sample of the actual US population would be even more
tween pairs of fictional patient “profiles” that vary according   heterogeneous. In this sense, both the static profile scoring
to these attributes. In order to make the profiles more con-      and the assessment of the algorithm by the proportion of each
crete, drinking behavior (“1 alcoholic drink per month” or “5     profile matched are flawed. In this work, we improve upon
alcoholic drinks per day”) was used as a proxy for behav-         both of these elements by removing the requirement for a sin-
ioral health, and cancer (“skin cancer in remission” or “no       gle utility function and developing an alternate methodology
other major health problems”) was used as a proxy for general     for modifying and evaluating the algorithm.
health. For example, a sample question asked participants to
choose between “Patient W.A. [who] is 30 years old, had 1 al-     3     Incorporating Heterogeneous Preferences
coholic drink per month (prior to diagnosis), and has no other    In our work we improve upon the methodology presented in
major health problems” and “Patient R.F. [who] is 70 years        Section 2.3 by removing the unrealistic assumption that soci-
old, had 5 alcoholic drinks per day (prior to diagnosis), and     etal preferences can be captured by a single utility function.
has skin cancer in remission”. They defined 8 such patient        We propose 1) an alternative preference aggregation method
profiles, with characteristics described in Table 1, and asked    that better captures the variation in expressed preferences, 2)
each participant to compare each pair of profiles. We will use    modifications to the kidney allocation algorithm to take this
these profile descriptions and this preference data in our own    new preference aggregation into account, and 3) an alterna-
work.                                                             tive evaluation metric for the resulting matchings that lends
   Freedman et al. used the Bradley-Terry Model (“BT              more consideration to individual welfare.
Model”) to estimate a “BT-score” for each patient profile.
The BT model assumes that each profile x has an underly-          3.1     Preference Aggregation Model: BLP
ing score px that represents the value that survey participants   Instead of learning a single score for each profile as in
collectively place on donating to a patient with that profile.    previous work [Freedman et al., 2020], we use the Berry-
Under this model, the probability that profile i will be pre-     Levinsohn-Pakes Model (“BLP Model”) to estimate a distri-
ferred to profile j is:                                           bution over possible utility functions. We propose that learn-
                                     pi                           ing and sampling from this distribution better satisfies in-
                    P (i > j) =                            (2)    dividual preferences than learning a single utility function.
                                  pi + pj                         The BLP model is an extension of the logit discrete choice
Patient profiles that are almost always selected by our survey    model that is widely used in estimating consumer discrete-
participants (such as profile 1 in Table 1) will therefore have   choice demand for differentiated products [Berry et al., 1995;
the highest scores, and profiles that are rarely selected (such   Nevo, 2001]. When we apply this model to kidney exchange,
the “consumers” are members of the population that the ex-             functions as a proxy for individual welfare because it repre-
change serves, and the “products” are the patients who may             sents the extent to which an individual’s domain-relevant val-
potentially be matched with donors. Using this model allows            ues were fulfilled. We claim that the average rank of matched
us to predict how the general population wants the exchange            donations is a better measure of the extent to which an algo-
to prioritize patients.                                                rithm values individual welfare than the proportion of each
    For a graph G = hV, Ei, we wish to define a utility func-          profile matched because the ranks of all matches depend on
tion U : V → R that determines the utility of the pa-                  the full BLP distribution. In contrast, the proportion matched
tient in each vertex receiving a utility function. The BLP             measure relies on the false assumption that everyone’s pref-
                                                >
model defines the utility function U(v) = Xp(v)    β +  where         erences are better satisfied if patients with higher BT-scores
Xp(v) = {1(vage = 30), 1(vdrinking = rare), 1(vcancer =                are matched more often.
healthy)} are the binary features of the patient profile of ver-          We run both algorithms on a simulated kidney exchange,
tex v, β ∼ N (µ, Σ) gives the weight of each feature, and  is         along with a third algorithm that weights all donations equally
an error term following a type-II extreme value distribution.          as a baseline. We evaluate the resulting matchings both on the
    We use maximum likelihood estimation to fit the distribu-          proportion of each profile matched, and on our proposed aver-
tion parameters (µ, Σ) to the pairwise comparison survey data          age rank measure. We find that our proposed algorithm con-
gathered by Freedman et al. Let P be the set of all patient            sistently outperforms both others on the rank measure, sug-
profiles described in Table 1, and for each pair of profiles           gesting that it better represents the full distribution of societal
i, j ∈ P , let ck (i, j) be survey respondent k’s preferred pro-       preferences.
file. This allows us to define the likelihood function
                                                                       4     Experiments
                                                                       4.1    Conditions
                                                                  
                              Y              exp(Xc>k (i,j) β)
Lk (µ, Σ | ck ) = Eβ∼N (µ,Σ)                                         We tested three versions of the matching algorithm: the base-
                                   i,j
                                         exp(Xi> β) + exp(Xj> β)       line one that weights all donations equally, the one with a
                                                         (4)           single utility function described in Section 2.3, and one with
and to estimate the maximum likelihood distribution parame-            a distribution over utility functions proposed in Section 3.1.
ters                                                                   Condition 1: EQUAL The EQUAL algorithm matches kid-
                                  N
                                                                       ney exchange participants using the LP in Eq 1. That is,
                               1 X                                     it weights all participants equally and chooses randomly
             µ̂, Σ̂ = argmax       log(Lk (µ, Σ | ck ))          (5)   amongst the highest-cardinality matchings. We use this con-
                      µ,Σ      N
                                 k=1
                                                                       dition as a baseline because it describes the case in which
3.2     Algorithm                                                      ethical preferences are not incorporated into the algorithm at
                                                                       all.
Each time a new patient-donor pair enters the exchange, we
add a corresponding vertex u to the graph, randomly sample             Condition 2: HOMOGENEOUS The HOMOGENEOUS algo-
a βu ∼ N (µ̂, Σ̂) from the learned distribution, and weight            rithm matches participants using the LP in Eq 3. It assigns
outgoing edges u → v using the resulting “BLP function”:               edge weights based on the BT-score of the recipient, relying
BLP (u, v) = Xp(v) >
                      βu + uv . In this way, we represent the         on the assumption that individual preferences are sufficiently
                                                                       homogeneous to be captured by a single static utility function.
full distribution of preferences. Note that the BLP function
                                                                       This is the algorithm proposed by Freedman et al.. See Ta-
indicates a random sample from the surveyed population’s
                                                                       ble 2 for the weights used in the EQUAL and HOMOGENEOUS
preference distribution – it does not represent the preferences
                                                                       conditions.
of donor u specifically. Letting wBLP (u,v) be the score that
vertex u’s sampled BLP function places on donating to the              Condition 3: HETEROGENEOUS The HETEROGENEOUS
patient in vertex v, we modify the LP in Eq 3 to be:                   algorithm matches participants using the LP in Eq 6. It sam-
                                                                       ples a BLP function when each vertex is added to the graph,
              P        hP                   i                          normalizes the scores produced by that function to the range
      max      c∈C(L)     (u,v)∈c wBLP (u,v) xc                        [0, 1], and uses that function and the profile of the recipient
                                                                       to weight each new outgoing edge. This allows for the pos-
              P
      s.t.    Pc:v∈c xc ≤ 1                             ∀v ∈ V
               c∈C(L) |c|xc ≥ Q
                                                                       sibility that heterogeneous individual preferences are better
                                                                 (6)   captured by a distribution than by a single utility function.
                                                                       This is the novel algorithm that we propose in this work.
3.3     Evaluation Metric
                                                                       4.2    Measures
We further define the rank of a donation u → v to be the po-
sition of v’s patient profile in the preference ordering induced       We evaluate each algorithm according to both the measure we
by u’s BLP function. For example, if the BLP function asso-            propose in Section 3 and the measure used by Freedman et al.
ciated with vertex u weights the profile of the patient in vertex      Average Rank The average rank of a matching is the aver-
v above all other profiles, rank(u, v) = 1. Conversely, if the         age rank of each donation in the matching, where rank(u, v)
BLP function weights the profile below the other seven pos-            of a donation u → v is as defined in Section 3.3. Recall
sible patient profiles, rank(u, v) = 8. In this context, rank          that lower ranks indicate higher preference satisfaction and,
           Profile ID   EQUAL      HOMOGENEOUS
                1        1.000           1.000
                2        1.000           0.103
                3        1.000           0.236
                4        1.000           0.036
                5        1.000           0.070
                6        1.000           0.012
                7        1.000           0.024
                8        1.000           0.003

Table 2: Patient profile weights for the EQUAL and HOMOGENEOUS
experimental conditions. The EQUAL algorithm values all profiles
equally, so all have weight 1.0. However, the HOMOGENEOUS
algorithm weights profiles according to their BT-scores. The
HETEROGENEOUS algorithm samples BLP functions throughout           Figure 2: Average rank of donations in each simulated run
matching and so does not have a static weight for each profile.    (N = 50). Lower ranks indicate more-preferred matches. The
                                                                   HETEROGENEOUS algorithm produces the lowest average ranks
                                                                   (median = 3.24), followed by the HOMOGENEOUS algorithm (me-
since there are 8 profiles, all possible ranks fall in the range   dian = 3.66), then the EQUAL algorithm (median = 4.06).
[1.0, 8.0]. For each run of each algorithm, we recorded the av-
erage rank of every matching in the simulation, then averaged
these to get the average rank value for that algorithm.            HOMOGENEOUS algorithm produces the next-lowest aver-
                                                                   age rank, followed by the EQUAL algorithm, which produces
Proportion Matched The proportion matched of a profile             the highest average rank. This is because the EQUAL algo-
is the proportion of patients of that type that entered the kid-   rithm weights all edges equally, matching recipients without
ney exchange pool and were subsequently matched. A pro-            any consideration of the personal characteristics used to de-
portion matched of 100% means that all patients of that type       fine their weight. The HOMOGENEOUS algorithm improves
were matched, and a proportion matched of 0% means that            upon this by considering the characteristics of donation re-
none were. For each run of each algorithm, we recorded the         cipients, but fails to approximate preferences as closely as
number of patients with each profile that entered the pool and     HETEROGENEOUS.
the number of patients of each profile that were eventually
matched, and used this to calculate proportion matched.            Proportion Matched Freedman et al. quantified the im-
                                                                   pact of their modified algorithm by comparing the propor-
4.3    Experimental Setup                                          tions of patients of each profile type matched by their al-
We built a simulator1 to mimic daily matching using the            gorithm against the proportions matched by the unmodified
EQUAL, HOMOGENEOUS, or HETEROGENEOUS algorithms                    algorithm, so for the sake of comparison we do the same.
based on previously developed tools [Dickerson and Sand-           The proportions of each profile matched by the EQUAL al-
holm, 2015; Freedman et al., 2020]. Each simulated “day”,          gorithm, the HOMOGENEOUS algorithm (proposed by Freed-
some pairs enter the pool, some pairs exit the pool, and then      man et al.) and the HETEROGENEOUS algorithm (proposed
the matching algorithm is run on the pairs that remain. The        in this work) are shown in Figure 3.
unmatched pairs remain in the pool to potentially be matched          Since it doesn’t take patient profiles into account, the
in the future. For each algorithm, we executed 50 runs of 5        EQUAL algorithm matched approximately the same percent-
years of simulated daily matching, and recorded the average        age of patients across all profiles. Since it prioritizes pa-
matching rank and profile proportions matched for each run.        tients solely based on profile, the HOMOGENEOUS algorithm
                                                                   matched the more popular profiles (1-3) more often and the
                                                                   less popular profiles (4-8) less. Notably, the HOMOGENEOUS
5     Results                                                      algorithm almost always matches patients of profile 1, indi-
Average Rank Since lower donation ranks indicate that              cating that a patient’s profile can be one of the major factors
the recipient is higher in the sampled preference ordering,        in determining whether they receive a kidney. However, the
we propose that algorithms that induce lower average ranks         HETEROGENEOUS algorithm prioritizes patients not directly
better satisfy population preferences. As expected, the pro-       based on their profile, but based on the sampled BLP func-
posed HETEROGENEOUS algorithm consistently produces                tion’s valuation of their profile. As a result, this algorithm
matchings with the lowest average rank (Figure 2). The             still tends to match more of the commonly-preferred profiles
                                                                   and fewer of the commonly-dispreferred ones, but sometimes
   1                                                               samples a BLP function from the tails of the distribution that
     All code for this paper can be found in the Variation pack-
age of github.com/RachelFreedman/KidneyExchange                    prioritizes patient profiles very differently.
                                                                        of weighting all potential kidney recipients equally, deciding
                                                                        how to prioritize them in an opaque and ad-hoc way [UNOS,
                                                                        2015], or prioritizing them based on a single static utility
                                                                        function [Freedman et al., 2020], we proposed learning a dis-
                                                                        tribution over surveyed preferences and then sampling from
                                                                        this distribution for dynamic weighting during matching. We
                                                                        furthermore proposed donation rank as a better measure of
                                                                        preference satisfaction. We implemented our proposed algo-
                                                                        rithm and compare it to predecessor algorithms on a kidney
                                                                        exchange simulation, finding that our algorithm better satis-
                                                                        fies survey participant preferences.
                                                                           Our model was estimated based on preference data elicited
                                                                        from MTurk survey participants, who are assuredly not rep-
                                                                        resentative of society in general. Future work should elicit
                                                                        preferences from a more representative sample, and perhaps
                                                                        privilege preferences expressed by domain experts and stake-
Figure 3: Proportion of patients of each profile matched in each sim-   holders such as doctors, policy-makers and kidney exchange
ulated run (N = 50). All algorithms match approximately 62%             participants. However, we believe that our sample was not
of patients overall, but the HOMOGENEOUS algorithm dispropor-           more heterogeneous than the US population as a whole, so
tionately matches profiles with higher BT-scores, the EQUAL algo-       we expect the challenge of heterogeneity and our method-
rithm matches all profiles approximately equally, and the propor-       ology to continue to be relevant for this expanded sample.
tions matched by the HETEROGENEOUS algorithm lie between the            Moreover, since even a representative sample of the general
other two.                                                              public would still lack relevant domain-specific knowledge
                                                                        about kidney transplants, future work should also investigate
                                                                        methodologies that allow domain experts to correct or mod-
   If the survey preferences had been completely homoge-
                                                                        erate the outcomes of this process.
neous, then the HETEROGENEOUS algorithm would have
                                                                           We hope that the challenges highlighted and methodology
produced the same results as the HOMOGENEOUS algo-
                                                                        prototyped in this work will suggest a roadmap for develop-
rithm. However, because preferences expressed in the sur-
                                                                        ing techniques for automating moral decision making in other
vey data sometimes differ from the utility function used for
                                                                        domains.
the HOMOGENEOUS algorithm, sometimes different matches
are made. For example, while most survey participants pre-
ferred patient profile 1 to all other profiles, some did not, so        Acknowledgements
the HETEROGENEOUS algorithm respects this heterogeneity                 We thank Yunhao Huang for implementing the BLP model,
by sometimes prioritizing matching other profiles over profile          the Duke Moral AI team for sharing the human subject data,
1. This difference in matching is a further indication that our         Dr. John Dickerson for building an earlier version of the kid-
proposed algorithm more faithfully represents the full distri-          ney exchange simulation, and Dr. Peter Eckersley for early
bution over preferences.                                                discussions of the idea.

6    Discussion                                                         References
Faithfully instantiating the collective values of groups with           [Abraham et al., 2007] David J Abraham, Avrim Blum, and
heterogeneous individual preferences is a frequent challenge               Tuomas Sandholm. Clearing algorithms for barter ex-
for real-world AI systems. For example, we commonly                        change markets: Enabling nationwide kidney exchanges.
use AI systems to allocate scarce resources – such as kid-                 In Proceedings of the 8th ACM conference on Electronic
ney donors, food donations [Lee et al., 2018] and interview                commerce, pages 295–304, 2007.
slots [Schumann et al., 2019] – amongst group members in                [Berry et al., 1995] Steven Berry, James Levinsohn, and
a way that we hope maximizes group welfare. Moreover,                      Ariel Pakes. Automobile prices in market equilibrium.
our roads may soon be populated with autonomous vehicles,                  Econometrica: Journal of the Econometric Society, pages
which will have to make moral tradeoffs – such determin-                   841–890, 1995.
ing who to sacrifice in unavoidable collisions [Bonnefon et
al., 2016] – based on the complex and often contradictory               [Biro and Cechlarova, 2007] Peter Biro and Katarina Cech-
moral frameworks of the communities in which they operate.                 larova. Inapproximability of the kidney exchange problem.
It is therefore vital that the AI community develop techniques             Information Processing Letters, 101(5):199–202, 2007.
for faithfully aggregating such heterogeneous preferences and           [Biró et al., 2017] Péter Biró, Lisa Burnapp, Bernadette
use them to develop socially beneficial AI systems.                        Haase, Aline Hemke, Rachel Johnson, Joris van de Klun-
    In this paper, we proposed, instantiated, and evaluated one            dert, and David Manlove. Kidney exchange practices
such technique for incorporating heterogeneous ethical pref-               in Europe, 2017. First Handbook of the COST Action
erences into a specific real-world AI system: an algorithm                 CA15210: European Network for Collaboration on Kid-
for matching patient-donor pairs in kidney exchange. Instead               ney Exchange Programmes.
[Bonnefon et al., 2016] Jean-François Bonnefon, Azim
   Shariff, and Iyad Rahwan.         The social dilemma of
   autonomous vehicles. Science, 352(6293):1573–1576,
   2016.
[Conitzer et al., 2015] Vince Conitzer, Markus Brill, and
   Rupert Freeman. Crowdsourcing societal tradeoffs. In
   Proceedings of the 2015 international conference on au-
   tonomous agents and multiagent systems, pages 1213–
   1217, 2015.
[Dickerson and Sandholm, 2015] John P Dickerson and
   Tuomas Sandholm. Futurematch: Combining human
   value judgments and machine learning to match in dy-
   namic environments. In Twenty-Ninth AAAI Conference
   on Artificial Intelligence, 2015.
[Freedman et al., 2020] Rachel Freedman, Jana Schaich
   Borg, Walter Sinnott-Armstrong, John P Dickerson, and
   Vincent Conitzer. Adapting a kidney exchange algorithm
   to align with human values. Artificial Intelligence, page
   103261, 2020.
[Lee et al., 2018] Min Kyung Lee, Daniel Kusbit, Anson
   Kahng, Ji Tae Kim, Xinran Yuan, Allissa Chan, Ritesh
   Noothigattu, Daniel See, Siheon Lee, Christos-Alexandros
   Psomas, et al. Webuildai: Participatory framework for fair
   and efficient algorithmic governance. Preprint, 2018.
[Manlove and O’Malley, 2015] David Manlove and Gregg
   O’Malley. Paired and altruistic kidney donation in the UK:
   Algorithms and experimentation. ACM Journal of Experi-
   mental Algorithmics, 19(1), 2015.
[Nevo, 2001] Aviv Nevo. Measuring market power in the
   ready-to-eat cereal industry. Econometrica, 69(2):307–
   342, 2001.
[Roth et al., 2004] Alvin E Roth, Tayfun Sönmez, and
   M Utku Ünver. Kidney exchange. The Quarterly journal
   of economics, 119(2):457–488, 2004.
[Roth et al., 2005] Alvin E Roth, Tayfun Sönmez, et al. A
   kidney exchange clearinghouse in new england. American
   Economic Review, 95(2):376–380, 2005.
[Schumann et al., 2019] Candice Schumann, Zhi Lang, Jef-
   frey Foster, and John Dickerson. Making the cut: A
   bandit-based approach to tiered interviewing. In Advances
   in Neural Information Processing Systems, pages 4641–
   4651, 2019.
[UNOS, 2015] UNOS. Revising kidney paired donation pi-
   lot program priority points, 2015. OPTN/UNOS Public
   Comment Proposal.
[Zhang and Conitzer, 2019] Hanrui Zhang and Vincent
   Conitzer. A pac framework for aggregating agents’
   judgments. In Proceedings of the AAAI Conference on
   Artificial Intelligence, volume 33, pages 2237–2244,
   2019.