=Paper=
{{Paper
|id=Vol-2903/IUI21WS-TExSS-13
|storemode=property
|title=Design Methods for Artificial Intelligence Fairness and Transparency
|pdfUrl=https://ceur-ws.org/Vol-2903/IUI21WS-TExSS-13.pdf
|volume=Vol-2903
|authors=Simone Stumpf,Lorenzo Strappelli,Subeida Ahmed,Yuri Nakao,Aisha Naseer,Giulia Del Gamba,Daniele Regoli
|dblpUrl=https://dblp.org/rec/conf/iui/StumpfSANNGR21
}}
==Design Methods for Artificial Intelligence Fairness and Transparency==
<pdf width="1500px">https://ceur-ws.org/Vol-2903/IUI21WS-TExSS-13.pdf</pdf>
<pre>
Design Methods for Artificial Intelligence
Fairness and Transparency
Simone Stumpfa , Lorenzo Strappellia , Subeida Ahmeda , Yuri Nakaob ,
Aisha Naseerc , Giulia Del Gambad and Daniele Regolid
a City, University of London, Northampton Square, London, UK
b Fujitsu Laboratories Ltd., Kawasaki, Japan
c Fujitsu Laboratories of Europe, Hayes, UK
d Intesa Sanpaolo S.p.A., Turin, Italy


                                     Abstract
                                     Fairness and transparency in artificial intelligence (AI) continue to become more prevalent as topics for
                                     research, design and development. General principles and guidelines for designing ethical and respon-
                                     sible AI systems have been proposed, yet there is a lack of design methods for these kinds of systems.
                                     In this paper, we present CoFAIR, a novel method to design user interfaces for exploring fairness, con-
                                     sisting of series of co-design workshops, and wider evaluation. This method can be readily applied in
                                     practice by researchers, designers and developers to create responsible and ethical AI systems.

                                     Keywords
                                     fairness, transparency, explanations, design, methods


1. Introduction                                                       fairly abstract and do not lend themselves to
                                                                      guiding how and what to design. Recent work
There has been extraordinary interest in mak- [4] has started to investigate design patterns
ing artificial intelligence (AI) systems ethi- to guide detailed user interface design. Over-
cal and responsible over the last decade [1, arching design methods for designing trans-
2]. Many principles and guidelines have been parent AI systems, beyond the User-Centred
proposed to ensure considerations for fair- Design (UCD) process, have also been pro-
ness, accountability, and transparency are                            posed [5].
made in the design and development of these                              In this paper, we review existing work on
systems [3]. However, these guidelines are design methods to guide designers of respon-
                                                                      sible and ethical AI systems and user inter-
Joint Proceedings of the ACM IUI 2021 Workshops, April                faces. We then present a new method, Co-
13–17, 2021, College Station, USA                                     designing Fair AI InteRactions (CoFAIR),
" Simone.Stumpf.1@city.ac.uk (S. Stumpf);
Lorenzo.Strappelli@city.ac.uk (L. Strappelli);
                                                                      which consists of a series of co-design work-
Subeida.Ahmed@city.ac.uk (S. Ahmed);                                  shops followed by a broader evaluation, to
nakao.yuri@fujitsu.com (Y. Nakao);                                    create suitable user interfaces that lend them-
Aisha.Naseer@uk.fujitsu.com (A. Naseer);                              selves to exploring fairness by targeted user
giulia.delgamba@intesasanpaolo.com (G.D. Gamba);
daniele.regoli@intesasanpaolo.com (D. Regoli)                         groups. We show the application of this meth-
 0000-0001-6482-1973 (S. Stumpf);                                    od through a case study. We discuss the limi-
0000-0002-6813-9952 (Y. Nakao); 0000-0003-2711-8343                   tations of our approach, and how this method
(D. Regoli)                                                           might be generalised to designing for ethical
          © 2021 Copyright for this paper by its authors. Use permit-
          ted under Creative Commons License Attribution 4.0 Inter-
          national (CC BY 4.0).
                                                                      and responsible AI systems.
 CEUR
               http://ceur-ws.org
                                    CEUR   Workshop        Proceedings
                                    (CEUR-WS.org)
 Workshop      ISSN 1613-0073
 Proceedings
2. Related Work                                   in practice, it is up to the designer or devel-
                                                  oper to craft appropriate ways to implement
It has been realised that Artificial intelligence this guideline.
and machine learning pose unique design chal- Google’s Responsible AI practices [11, 12]
lenges that merit new design practices [6, 7, suggest that ethical AI systems should be de-
8, 9]. In the last few years, a number of ap- signed following best practices for software
proaches have been suggested to ease the de- systems but then supplemented with consid-
sign and development of responsible and eth- erations specific to machine learning. Over-
ical AI systems. Here, we present an overview all a human-centered design approach shoudl
of guidelines to designing ethical AI systems, be followed to actively consider fairness, in-
before turning to describing work that aims terpretability, privacy and security from the
to address design patterns and methods.           outset. Specific advice for designing the user
                                                  experience for AI systems has been given by
2.1. Design Guidelines                            the People + AI Handbook [12], such as iden-
                                                  tifying user needs and their mental models,
Considerable thought has been given to pro- or addressing explainability and trust. While
viding guidelines for designing and develop- these guidelines do not explicitly surface fair-
ing these ethical AI systems. The most well- ness as a specific consideration, it is covered
known of these have been developed by Mi- when collecting and evaluating data and also
crosoft, Google and IBM, with some efforts in communicating with users.
also being produced by the High-Level Ex-            IBM’s Everyday Ethics for Artificial Intelli-
pert Group (HLEG) on AI set up by the Eu- gence [13, 14] suggests five areas to focus on
ropean Commission. We will briefly review in the development of ethical AI systems: ac-
these efforts but see [3] for a comprehensive countability, value alignment, explainability,
survey of AI ethics guidelines.                   fairness and user data rights. The guidelines
   Microsoft’s Guidelines for Human-AI In- present a rationale of why these aspects re-
teractions [10] as part of their Responsible quire attention, make recommendations for
AI area are implemented as a set of eighteen actions to take and for questions the design
cards. Each card describes a guideline and team should consider, and provide examples
some examples of how that guideline might of implementations.
apply in practice, over four stages of use: ‘ini-    The HLEG on AI ethics guidelines for trust-
tially’, ‘during interaction’, ‘when wrong’,      worthy AI [15] set out a framework for eth-
and ‘over time’. These guidelines provide de- ical principles and associated requirements
signers and developers with high-level con- that should be covered in AI development. In
siderations to make during the design pro- applying this framework, the report suggests
cess. For example, guideline 6 prompts to "mit- adopting both technical and non-technical
igate social biases" during interaction by en- methods, such as transparency-by-design or
suring that “the AI system’s language and be- inclusive design teams. In order to assess that
haviors do not reinforce undesirable and un- AI has been developed in accordance with
fair stereotypes and biases.” Guideline 11 is these principles and requirements, the report
to “make clear why the system did what it also puts forward a checklist to be used within
did” when wrong and suggests to “enable the design practices.
user to access an explanation of why the AI          While guidelines to develop responsible and
system behaved as it did”. While each comes ethical AI have some use to stimulate discus-
with an example of how this might be realised sions within design teams about high-level
concepts and requirements that need to be          tems using a structured process. At the mo-
met, as noted previously [16], these guide-        ment, most of the guidelines mentioned in
lines are fairly abstract and are difficult for    section 2.1 suggest adopting a User-Centred
designers and developers to implement into         Design (UCD) process involving user research,
practice.                                          designing and prototyping and evaluating, us-
                                                   ing techniques such as interviews, observa-
2.2. Design Patterns                               tions, and user testing. Yet given that many
                                                   have argued that AI system design pose sig-
Currently, there is a lack of design patterns nificant challenges [6, 7, 8], there is yet a dearth
for AI systems, which tells designers and de- of work that addresses design methods that
velopers what to design. In HCI and data vi- guide designers and developers to develop re-
sualisation, design patterns for common use sponsible AI.
cases and scenarios on well-studied technolo-         Very recently, design methods have been
gies are readily available1 . These tell design- proposed that focus on designing AI algorithms
ers and developers how to support interac- with users. WeBuildAI [22] proposes a frame-
tions and communications through a user in- work of steps that involves users in design-
terface. Similarly, there has been a line of re- ing algorithms. This method proceeds by in-
search in Explainable AI (XAI) that aims to vestigating feature engineering and selection
establish what information to communicate through surveys and interviews, model build-
and what interactions to support in order to ing through pair-wise comparison of use by
make a system transparent. High-level prin- users, and finally model selection through ex-
ciples for explainability and controllabity have posing the model decisions.
been proposed [17], such as ‘be sound’, ‘be           The most well-known attempt to establish
complete’, ‘be actionable’, and ‘be reversible’. a design method for ethical AI user interfaces
   In addition, there is a emerging body of re- is transparency design [5]. This work pro-
search that aims to investigate what is most poses a stage-based process to first investi-
effective in terms of user interfaces that pro- gate mental models of experts and then users
vide explanations. A lot of work has focused to establish a target mental model of what
on what information should be available to needs to be explained, before iteratively pro-
users and how this information should be com- totyping the user interface to establish how
municated via text, graphics or visualizations to communicate the explanations and then
[18, 19, 20, 21, 14]. A recent effort to start de- evaluating it. To develop the mental mod-
veloping design patterns [4], backed by cog- els of experts, interviews and workshops are
nitive psychology, has suggested links (or pat- suggested, while to investigate users’ men-
terns) of how people should reason, how peo- tal models it is suggested to employ surveys,
ple actually reason, and how to generate ex- interviews, task-based studies and drawing
planations that support reasoning.                 tasks. For developing the target model, card
                                                   sorting, interviews and focus groups were pro-
2.3. Design Methods                                posed. Designing and evaluating the user in-
                                                   terfaces can involve focus groups, workshops,
There is only scarce considerations of design and think-aloud studies. There are now sev-
methods for telling designers and developers eral case studies that have used this process
how to design ethical and responsible AI sys- to successfully implement explanations in AI
                                                   interfaces [23, 24, 25].
    1 http://ui-patterns.com/                         Our work is concerned with investigating
Figure 1: The CoFAIR process


design methods for user interfaces that can 3.1. Co-design workshops
help with making the fairness of AI algorithms
                                               To start, CoFAIR comprises a series of work-
transparent, and then help with mitigating
                                               shops to work closely with a limited number
fairness issues by incorporating user feedback
                                               of participants to research the topic area, to
back into the algorithm.
                                               develop some designs, and then to test those
                                               designs. To set up these workshops, a num-
3. The CoFAIR method                           ber of considerations will need to be made:

We present here our method to Co-design Fair 3.1.1. Recruitment
AI InteRactions, CoFAIR (Fig. 1). This method
is based on a co-design process [26] which Participants in a workshop should be the tar-
aims to work closely with users to develop geted users of an AI system. The aim is to
solutions through a participatory design ap- closely involve these participants in design-
proach. As other co-design approaches, It is ing a solution that is right for them, and to
characterised by very close involvement of align the design with their requirements. If
a small number of users in all stages of de- there are a number of different user groups
signing a solution, in which these users are that are distinct in their background, use cases
empowered to be on equal footing with re- or tasks, then separate workshops should be
searcher and designers. Co-design has been organised for them. The users will not need
successfully adopted to design human-centred to have a detailed technical understanding or
technology in other settings [27, 28] however any experience with system design or devel-
how to use co-design in shaping AI solutions opment, as they will be supported by research-
has not been investigated yet.                ers, designers, and developers. Ideally, they
   Our proposed method for responsible AI should be relatively representative of the user
includes a series of co-design workshops with group in terms of background and demograph-
participants/co-designers that focus on user ics. For each workshop, the number of partic-
research, conceptual and detailed design, and ipants should be kept low, between 3-6 peo-
initial testing, which is then broadened in a ple, so as to encourage interaction between
final user evaluation stage.                  participants.
3.1.2. Workshop Aims and Structure              involve real or hypothetical scenarios and the
                                                user experiences around the topic. Activities
We suggest that the workshops aim to cover
                                                would typically explore problematic aspects
three main steps in user-centred design: user
                                                and the challenges that users face in carrying
research, conceptual and detailed design, and
                                                out a user task. They would also probe for
testing. User research in these workshops
                                                basic understandings and conceptualisations
should investigate the users’ current concep-
                                                around the topic of investigation. These can
tualisations and experience within the topic
                                                be (but don’t have to be) documented in per-
area, pain points, and high-level needs and
                                                sonas, and recent work has shown how these
wants. This user research can be formalised
                                                personas can be co-created with co-design par-
and communicated through co-created per-
                                                ticipants [27, 29].
sonas that reflect the target user group [27,
                                                   Activities that aim to support design are
29] or could be more informal as simple lists
                                                also kept very concrete. Typically this would
of requirements. Conceptual and detailed de-
                                                investigate a scenario of use, either real
sign will involve the participants in surfacing
                                                or fictitious. As part of conceptual design,
what information and interactions are needed
                                                participants would usually be invited to go
to achieve their tasks while also clarifying
                                                through the scenario of use and indicate what
how to present this in the user interface. This
                                                they would look for, what interactions they
might be documented in storyboards, user
                                                would expect, and what information the sys-
journeys, and sketches, or produce scenario-
                                                tem would need to communicate. It is some-
based object-action analyses. Last, these de-
                                                times helpful to develop storyboards or user
sign should be prototyped, either using low-
                                                journeys with co-design participants. Detail-
fidelity paper prototypes or more high-fidelity
                                                ed design can flesh out design options through
clickable wireframes, and then tested with par-
                                                sketches, however, this needs to be carefully
ticipants.
                                                supported and scaffolded as participants are
   Depending on the complexity of what is
                                                often too timid to sketch themselves.
to be designed, these steps need to be spread
                                                   In testing, a prototype, often created or re-
over a series of workshops. Most naturally,
                                                fined by a designer/developer offline, is ex-
these steps suggest three sequential work-
                                                posed to evaluation by co-design participants.
shops, each with a distinct focus on user re-
                                                Again, a real or fictitious scenario is used to
search, design, and testing. It might be possi-
                                                explore how the prototype might be used and
ble to combine user research and design, and
                                                what improvements are necessary for a sub-
thus reduce the number of workshops to two.
                                                sequent iteration.
However, more iterations might be needed to
explore design options and iteration of proto-
types, and thus more workshops might need 3.2. Broader Evaluation
to be scheduled. Our method is flexible enough A common criticism of co-design is the lim-
to accommodate this.                            ited number of participants that are involved
                                                 in developing a solution. This leads to the
3.1.3. Workshop Activities                       fear that while the solution is optimally adapt-
To achieve the aims of the workshops, co- ed to the 3-6 participants in the workshops,
design usually proceeds with group-based,        it is unsuitable for the wider user population.
hands-on activities and discussion around        Our method suggests that co-design is always
these activities. For user research, these could followed by broader evaluation of the design-
                                                 ed system through evaluations with users.
This can take various forms, such as think- in Japan through social media and personal
aloud user testing or large-scale crowd-sourc- contacts. All participants’ ethnicities broadly
ed system use.                                 reflected the population of the country, and
                                               most participants had been educated to a Bach-
                                               elor degree level. We paid an incentives of
4. Method Case Study:                          £40, or equivalent in the local currency.
    Loan Application
                                                    4.1.2. Workshop procedure
    Fairness
                                                    For each country, we held 2 co-design work-
In order to show how our method can be in-          shops; these two workshops were 3 weeks
stantiated in practice, we present a case study     apart. Both workshops lasted 2 hours.
in which we investigated how to develop user           In workshop 1, we conducted user research
interfaces that allow users to explore the fair-    and conceptual design. For the user research
ness of AI loan application decisions. Loan         part, we investigated how participants defin-
applications decisions are increasingly being       ed fairness, and then how they explored fair-
automated or supported using AI models (typ-        ness in loan decisions. For investigating how
ically, employing logistic regression). This        participants viewed AI fairness, we first got
study targeted three different user groups in       participants to tell us about their own experi-
two iterations: non-expert members of the           ences of fair or unfair decisions that affected
public (iteration 1), loan officers and data sci-   them, especially if they encountered AI in that
entists (iteration 2). Iteration 1 details how      decision-making. We then also probed them
we instantiated the method with non-expert          to consider fairness of using AI systems in
customers, while iteration 2 is concerned with      hiring or making medical decisions and what
the method used with loan officers and data         makes AI systems fair or unfair.
scientists. We focus on the techniques em-             To continue user research and start on con-
ployed in our method; we will report on the         ceptual design, we constructed an activity in-
findings of these studies elsewhere.                volving four fictitious loan application sce-
                                                    narios (Fig. 2). This allowed us to further in-
4.1. Iteration 1: Non-expert                        vestigate what attributes and information they
     Members of the Public                          were looking for to assess the fairness of the
                                                    applications’ outcomes and potentially what
We ran a series of co-design workshops with         they would change to make the decisions fairer.
a total of 12 participants in the USA, UK, and      Each scenario was discussed in turn, whether
Japan. Because of COVID-19 restrictions we          it was fair, why (based on the information in-
had to change our planned face-to-face work-        cluded in the application or their experience
shops to be conducted entirely online.              of the decisions they had seen), and what in-
                                                    formation would have been useful for them
4.1.1. Co-Design participants                       to assess fairness better. We changed some
                                                    of the application scenario details to localize
We recruited 3 participants (2 women, 1 man,        them to each country (e.g. names, currency,
mean age 47.3) for the co-design workshops          dates) but otherwise kept them the same. We
held in the USA, 5 participants (3 women, 2         showed participants information that is usu-
men, mean age 34.2) in the UK, and 4 par-           ally collected as part of a loan application pro-
ticipants (3 women, 1 man, mean age 33.75)          cess, based on the application form of a well-
        A                       B                     C                     D


Figure 2: (A) Application 1: Mark Benson, (B) Application 2: Sadia Mohammed, (C) Application 3,
Jennifer Clary, and (D) Application 4: Kwame Odejima


known international bank. Application 1          formation that they used for fairness assess-
(USA/UK: Mark Benson or Kazufumi Taka- ments and requests for further information
hashi) was always approved, as it was a ’safe’ obtained in workshop 1 to interface design
application, with a homeowner with a very elements, and we did not involve participants
good credit score applying for a small loan to in detailed design activities.
buy a used car. Application 2 (USA/UK: Sadia        In workshop 2, we moved on to a testing
Mohammed or Chihe Pak) was rejected, as it activity. We structured our discussion on the
was a more ‘risky’ application with low in- clickable wireframes, and developed some sce-
come, part-time job and low credit score. We narios to explore fairness using the clickable
also included her application to investigate prototype. Going through each screen’s func-
any potential minority or age biases. Appli- tionality, we discussed what helped to under-
cation 3 (USA/UK: Jennifer Clary or Maika stand if the application decisions were fair,
Suzuki) was also rejected but crucially her what additional information would they like
details were very similar to Mark Benson. This to determine fairness, and what feedback they
was to introduce an application that seemed, would like to give to mitigate fairness.
without any further information, to be bla-
tantly unfair. Finally, application 4 (USA/UK: 4.1.3. Broader evaluation
Kwame Odejima or Dũng Nguyên,) was ac-
cepted although it seemed more ‘risky’.          Following the co-design workshops, we im-
   After the workshop, two researchers re-       plemented an improved interface. We then
viewed the workshop recordings and anal- set up an online study to investigate how this
ysed the participants’ definitions of AI fair- prototype is employed by end-users to assess
ness and how they thought AI could be made the fairness of an AI system, and how sug-
fairer. For each scenario, we analysed what gested changes to the model affect fairness.
criteria they used to assess fairness, how they     We recruited 388 participants (129 female,
were using information to explore fairness,      256  male, 2 Other and 1 preferred not to say)
and what other information they wanted to        through    Prolific2 , an online research platform,
be able to assess whether a loan application and paid them £3.50 for an expected 30-minute
decision was fair, or potentially biased. Based session. About half of of our participants had
on this analysis, we constructed clickable wire- some programming experience and familiar-
frames to instantiate their input in an inter- ity with AI, machine learning or statistics, and
face. We did this by carefully mapping in-           2 https://www.prolific.co/
146 participants had at least a Bachelor de- [30].
gree.                                               On study completion, we analysed the in-
   We asked participants to interact with the teractions with the prototype to evaluate
interface to a assess the fairness of an AI sys- whether this prototype was effective in sup-
tem. Instead of using an open-source dataset, porting users in exploring the fairness of an
the AI system we developed was based on AI model.
an anonymized loan decisions dataset we ob-
tained from Intesa Sanpaolo. This dataset con- 4.2. Iteration 2: Loan Officer and
tains decisions made on 1000 loan applica-
                                                        Data Scientists
tions and has 35 attributes including the label
of whether the loan application was accepted 4.2.1. Co-design Participants
or rejected. These attributes include demo-
                                                 This iteration was focused on exploring how
graphic information of the applicant (age, gen-
                                                 to support loan officers and data scientists to
der, nationality, etc), financial information
                                                 explore the fairness of loan application de-
(household income, insurance, etc), loan in-
                                                 cisions. These two stakeholder groups are
formation (amount of loan requested, purpose
                                                 different: loan officers typically act as inter-
of loan, loan duration, monthly payments, etc),
                                                 mediaries between the bank and customers
as well as some information of their financial
                                                 and had practical experience of loan decision
and banking history (years of service with
                                                 making, while data scientists have experience
the bank, etc). There were also some attributes
                                                 in modelling and supporting and/or investi-
that related to internal bank procedures, such
                                                 gating customer application decisions. For
as a money laundering check and a credit score
                                                 this study, we recruited six loan officers (5
developed by the bank. We developed a lo-
                                                 men, 1 woman, mean age 36.5) and six data
gistic regression model after removing sparse
                                                 scientists (3 men, 3 women, mean age 29.7)
values, or where multiple attributes had simi-
                                                 through Intesa Sanpaolo.
lar values; the accuracy of the resulting model
was 0.618. Note that the model was unfair
with respect to the nationality attribute: ’for- 4.2.2. Workshop Procedure
eign’ applicants tended to be rejected more Due to Covid-19 and logistical limitations, all
frequently than citizens, using disparate im- interactions with the users were conducted
pact as a fairness metric.                       online. We structured the activities into two
   The evaluation consisted of a brief pre-ques- workshops, each lasting 2 hours. Both work-
tionnaire and tutorial, 20 minutes of free use shops were repeated for each separate stake-
of the interface to assess fairness, and a post- holder group.
questionnaire. To evaluate the use of this pro-     As with the previous iteration, the aim of
totype we captured participants’ ratings of workshop 1 was to conduct user research into
the AI fairness and key interactions with the how fairness was perceived by these user
user interface where logged. We also asked groups, and to carry out initial conceptual de-
them to describe in their own words what sign. Workshop 1 started off by discussing
strategies they used to assess the fairness of the aspects that make decisions in loan ap-
the system, any systematic fairness issues they plications fair or unfair to get an insight on
had spotted, and their views on suggesting participants’ loan application experience and
changes and addressing fairness. We then fin- unfair scenarios that they may have come up
ished the study by asking them to rate their against. This was followed by how AI could
task load using the NASA-TLX questionnaire
impact loan application decision-making and loan amount and number of instalments in
fairness.                                           detail.
   To further our user research and also un-           The aim of workshop 2 was to informally
derstand what key information is important test the clickable wireframe. This wireframe
to use in conceptual design, we then intro- was screen shared and the researcher ’drove’
duced an activity to explore the anonymized the interactions with it and acted as an
loan decisions dataset we obtained from In- extension on the participants’ behalf, click-
tesa Sanpaolo. The dataset was sent ahead ing through it. The researcher then stepped
of the workshop so that participants could through it with the respective user groups,
have time to look at it and have it available on and probed whether they understood how it
their computers during the session. The dis- worked, wheth-er the information was use-
cussion elicited information on participants’ ful for exploring fairness, or what could be
process, information needs, and the function- improved.
ality required to develop an interface. To help        Analysis of the second workshop investi-
participants investigate the dataset, a data vi- gated changes that needed to be made to im-
sualisation tool was created which was used prove the clickable prototype for broader eval-
to present the dataset should participants re- uation. Based on this analysis the researchers
quire it. It provided the ability to slice the fea- designed a prototype (Fig. 4).
tures on the fly and present them using var-
ious chart types such as histograms, scatter 4.2.3. Broader Evaluation
plots, bar graphs and a strip plot.
   Next, we introduced an activity to reflect The evaluations were conducted as one-to-
on a causal graph, showing causal relation- one user tests, unlike the workshops in the
ships between the dataset attributes. This          previous phase. A total of 17 participants were
causal graph was derived through automatic recruited through Intesa Sanpaolo: 8 loan of-
discovery, showing how attribute values and ficers (5 men, 3 women, mean age 38) and 9
the loan application decisions are related to data scientists (5 men, 4 women, mean age
each other. Through this activity we aimed to 31.8). All participants held a master’s degree
understand how these users might interpret or higher.
the causal graph and how this might be em-             We developed ten tasks for participants to
ployed in exploring the dataset for fairness.       go through the prototype, from setting up the
   After the first workshops, a researcher anal-    dataset to explore to investigating the dataset,
ysed the audio recordings to derive findings        using  different  components of the user inter-
about how these user groups judged whether          face. The   study  concluded with a post-ques-
loan applications were fair, how these users        tionnaire  used   to evaluate users’ experience.
explored the dataset to determine fairness, and     This  questionnaire    comprised ratings aimed
how they interpreted the causal graph. Based        at quantifying   how   effective the prototype was
on this analysis the researcher developed a         in supporting    users  in assessing   fairness in-
clickable wireframe to be used in workshop          cluding   information,    functionality   and rea-
2 (Fig. 3). Again, we did not involve the users     soning,  free  comments     to express  their feed-
in detailed design. Due to implementation           back  about  the  prorotype,   and the NASA    TLX
constraints, we only made a selection of the        questionnaire    [30].
wireframe interactive, and focused on a sce-           The broader evaluation was analysed as to
nario in which to explore the relationships         what   worked well and what did not, in order
between citizenship, gender, credit risk level,     to develop   functioning interfaces in future.
Figure 3: The clickable wireframe used in Workshop 2. (A) System overview. (B) Attribute information
including name, value distribution, fairness metric and weight. (C) Causal graph of selected attributes.
(D) Dataset. (E) Comparison of currently selected application in dataset and all other applications with
respect to similarity and application outcome.


5. Discussion                                  design activities and compressed them into
                                               two workshops of two hours each. Ideally
We have gained some experience from apply- we would like to extend them to span three
ing co-design in other application domains, workshops and for a longer duration. Sec-
and through a case study where we imple- ond, facilitation of online discussions is very
mented the CoFAIR method to develop inter- difficult, and ideally we would have brought
faces for exploring fairness. This showed that users together to discuss this more freely face-
this method can be successfully employed to to-face. Last, we would have liked to involve
design interfaces for responsible AI systems. users much more in conceptual and detailed
However, we encourage other researchers and design, for example, through sketching or pa-
practitioners to adopt this method and gen- per prototyping but this is very difficult to do
erate more data points to improve this ap- virtually.
proach, and also to validate it. In addition,     We can also note some general limitations
CoFAIR was so far employed under COVID- of the CoFAIR method which should be con-
19 restrictions which meant that all workshop sidered before it is chosen as a design approach.
activities and testing had to be conducted re- First, as with all co-design there is a danger
motely online, which impacted what we were that interfaces are developed that only fit the
able to do. If we had not been placed in this small number of people that were involved
situation, we would have made different choic- as users in the workshops. This can be allevi-
es as how to conduct the workshops. First, ated through conducting broader evaluations
due to the online nature we shortened the co- that ensure that the designs are fit for pur-
Figure 4: The evaluation prototype. (A) Causal graph of selected attributes. (B) Attribute information
including fairness metrics, and value distributions with respect to application decision. (C) System
overview including total acceptance. (D) Datset view, highlighting currently selected attribute. (E) At-
tribute combination view, showing subset information. (F) Comparison of similarity currently selected
application with other applications.


pose. Second, it is not a ’discount’ methodol- of working. How to successfully mitigate fair-
ogy that is fast and easy to apply. Implement- ness issues, especially through a human-in-
ing it requires several lengthy workshops with the-loop approach, is still an open research
users to be organised, separated in time so question.
that researchers and designers can analyse        We believe that our method is another step
and produce new materials in subsequent ac- to strengthen the design of responsible and
tivities. This means that even relatively small ethical AI. A major advantage of CoFAIR is
projects can spread over several months, from that it produces designs and interfaces that
initial recruitment of users to a fully refined focus heavily on what specific target users
and evaluated interface. Because we want to need and want. It thus produces ’shrink-wrap-
guard against ’overfitting’ designs to small ped’ interfaces that should be eminently suit-
numbers of participants, it is not advisable to able to communicate with a specific user group.
cut short this process and skip the broader Taken together, this method could be easily
evaluation to save on time. Last, this method extended to investigate what and how to ex-
focuses very much on the mental model of plain machine learning systems, in order to
users and does not account for the input of design more responsible and ethical AI sys-
’experts’ or consider how people should rea- tems.
son. Hence, it is possible that we might build
in possible biases that users have back into
these interfaces, and only support current ways
6. Conclusion                                         Intelligence       1     (2019)     389–399.
                                                      URL:             https://www.nature.com/
In this paper, we outlined that practical de-         articles/s42256-019-0088-2.
sign methods that translate general guidelines        doi:10.1038/s42256-019-0088-2,
into concrete processes to follow are in short        number: 9 Publisher: Nature Publishing
supply. We presented the CoFAIR method to             Group.
design responsible AI: co-design workshops [4] D. Wang, Q. Yang, A. Abdul, B. Y. Lim,
that focus on user research, conceptual and           Designing Theory-Driven User-Centric
detailed design and initial testing are followed      Explainable AI, in: Proceedings of the
by broader evaluation. We showed how we               2019 CHI Conference on Human Fac-
implemented this method through a case study          tors in Computing Systems, CHI ’19,
which focused on supporting non-expert ’end-          ACM, New York, NY, USA, 2019, pp.
users’, loan officers, and data scientists to ex-     601:1–601:15. URL: http://doi.acm.org/
plore fairness in loan application decisions.         10.1145/3290605.3300831. doi:10.1145/
We discussed the considerations that need to          3290605.3300831, event-place: Glas-
be made when choosing this method. We be-             gow, Scotland Uk.
lieve that other researchers, designers and prac- [5] M. Eiband, H. Schneider, M. Bilandzic,
titioners of responsible AI systems can adopt         J. Fazekas-Con, M. Haug, H. Hussmann,
this approach to develop suitable interfaces.         Bringing Transparency Design into
                                                      Practice, in: 23rd International Confer-
                                                      ence on Intelligent User Interfaces, IUI
References                                            ’18, ACM, New York, NY, USA, 2018,
  [1] M. K. Lee, Understanding perception of          pp. 211–223. URL: http://doi.acm.org/
      algorithmic decisions: Fairness, trust,         10.1145/3172944.3172961. doi:10.1145/
                                                      3172944.3172961.
      and emotion in response to algorithmic
      management, Big Data & Society 5            [6] G. Dove, K. Halskov, J. Forlizzi, J. Zim-
      (2018) 2053951718756684. URL: https:            merman,     Ux design innovation: Chal-
      //doi.org/10.1177/2053951718756684.             lenges  for working    with machine learn-
      doi:10.1177/2053951718756684,                   ing  as  a  design    material,     in: Pro-
      publisher: SAGE Publications Ltd.               ceedings   of   the  2017  CHI  Conference
  [2] M. Veale, M. Van Kleek, R. Binns, Fair-         on Human Factors in Computing Sys-
      ness and accountability design needs            tems, CHI ’17, Association for Com-
      for algorithmic support in high-stakes          puting Machinery, New York, NY, USA,
      public sector decision-making, in: Pro-         2017, p. 278–288. URL: https://doi.org/
      ceedings of the 2018 CHI Conference             10.1145/3025453.3025739. doi:10.1145/
                                                      3025453.3025739.
      on Human Factors in Computing Sys-
      tems, CHI ’18, Association for Com-         [7] Q. Yang, A. Steinfeld, C. Rosé, J. Zim-
      puting Machinery, New York, NY, USA,            merman,           Re-examining whether,
      2018, p. 1–14. URL: https://doi.org/            why,   and  how    human-ai interaction is
      10.1145/3173574.3174014. doi:10.1145/           uniquely    difficult to design, in: Pro-
      3173574.3174014.
                                                      ceedings   of   the  2020  CHI Conference
  [3] A. Jobin, M. Ienca, E. Vayena,                  on  Human     Factors   in Computing Sys-
      The global landscape of AI ethics               tems,   CHI   ’20,   Association   for Com-
      guidelines,            Nature Machine           puting  Machinery,     New   York,  NY, USA,
                                                      2020, p. 1–13. URL: https://doi.org/
     10.1145/3313831.3376301. doi:10.1145/              R. Caruthers, Everyday Ethics for
     3313831.3376301.                                   Artifical Intelligence, 2019. URL:
 [8] Q. Yang, A. Steinfeld, C. Rosé, J. Zim-            https://www.ibm.com/watson/assets/
     merman,          Re-examining whether,             duo/pdf/everydayethics.pdf.
     why, and how human-ai interaction is          [14] Q. V. Liao, D. Gruen, S. Miller, Question-
     uniquely difficult to design, in: Pro-             ing the AI: Informing Design Practices
     ceedings of the 2020 CHI Conference                for Explainable AI User Experiences, in:
     on Human Factors in Computing Sys-                 Proceedings of the 2020 CHI Confer-
     tems, CHI ’20, Association for Com-                ence on Human Factors in Computing
     puting Machinery, New York, NY, USA,               Systems, CHI ’20, Association for Com-
     2020, p. 1–13. URL: https://doi.org/               puting Machinery, New York, NY, USA,
     10.1145/3313831.3376301. doi:10.1145/              2020, pp. 1–15. URL: https://doi.org/
     3313831.3376301.                                   10.1145/3313831.3376590. doi:10.1145/
 [9] L. E. Holmquist, Intelligence on Tap:              3313831.3376590.
     Artificial Intelligence As a New De-          [15] High-Level Expert Group on Artificial
     sign Material, interactions 24 (2017)              Intelligence, Ethics Guidelines for
     28–33. URL: http://doi.acm.org/10.1145/            Trustworthy AI, https://ec.europa.eu/
     3085571. doi:10.1145/3085571.                      digital-single-market/en/news/ethics-
[10] S. Amershi, D. Weld, M. Vorvoreanu,                guidelines-trustworthy-ai, 2019.
     A. Fourney, B. Nushi, P. Collisson,           [16] M. A. Madaio, L. Stark, J. Wort-
     J. Suh, S. Iqbal, P. N. Bennett, K. Inkpen,        man Vaughan, H. Wallach,               Co-
     J. Teevan, R. Kikin-Gil, E. Horvitz,               Designing Checklists to Understand Or-
     Guidelines for Human-AI Interaction,               ganizational Challenges and Opportu-
     in: Proceedings of the 2019 CHI Confer-            nities around Fairness in AI, in: Pro-
     ence on Human Factors in Computing                 ceedings of the 2020 CHI Conference
     Systems, CHI ’19, Association for Com-             on Human Factors in Computing Sys-
     puting Machinery, New York, NY, USA,               tems, CHI ’20, Association for Com-
     2019, pp. 1–13. URL: https://doi.org/              puting Machinery, New York, NY, USA,
     10.1145/3290605.3300233. doi:10.1145/              2020, pp. 1–14. URL: https://doi.org/
     3290605.3300233.                                   10.1145/3313831.3376445. doi:10.1145/
[11] M. Mitchell, S. Wu, A. Zaldivar,                   3313831.3376445.
     P. Barnes, L. Vasserman, B. Hutchin-          [17] T. Kulesza, M. Burnett, W.-K. Wong,
     son, E. Spitzer, I. D. Raji, T. Gebru,             S. Stumpf,       Principles of Explana-
     Model Cards for Model Reporting,                   tory Debugging to Personalize Interac-
     in: Proceedings of the Conference on               tive Machine Learning, in: Proceed-
     Fairness, Accountability, and Trans-               ings of the 20th International Confer-
     parency, FAT* ’19, Association for                 ence on Intelligent User Interfaces, IUI
     Computing Machinery, New York, NY,                 ’15, ACM, New York, NY, USA, 2015,
     USA, 2019, pp. 220–229. URL: https:                pp. 126–137. URL: http://doi.acm.org/
     //doi.org/10.1145/3287560.3287596.                 10.1145/2678025.2701399. doi:10.1145/
     doi:10.1145/3287560.3287596.                       2678025.2701399.
[12] People + AI Guidebook, 2019.                  [18] V. Bellotti, K. Edwards, Intelligibility
     URL:         https://pair.withgoogle.com/          and Accountability: Human Consid-
     guidebook.                                         erations in Context-aware Systems,
[13] F. Rossi, A. Sekaran, J. Spohrer,                  Hum.-Comput. Interact. 16 (2001) 193–
     212. URL: http://dx.doi.org/10.1207/             Angeles, 2019. URL: http://ceur-ws.org/
     S15327051HCI16234_05. doi:10.1207/               Vol-2327/.
     S15327051HCI16234_05.                       [24] M. Ribera, A. Lapedriza, Can we do bet-
[19] B. Y. Lim, A. K. Dey, Investigating              ter explanations? A proposal of User-
     Intelligibility for Uncertain Context-           Centered Explainable AI, in: Explain-
     aware Applications, in: Proceedings              able Smart Systems (ExSS), 2019, p. 7.
     of the 13th International Conference             URL: http://ceur-ws.org/Vol-2327/.
     on Ubiquitous Computing, UbiComp            [25] G. Wiegand, M. Schmidmaier, T. We-
     ’11, ACM, New York, NY, USA, 2011,               ber, Y. Liu, H. Hussmann, I Drive
     pp. 415–424. URL: http://doi.acm.org/            - You Trust: Explaining Driving Be-
     10.1145/2030112.2030168. doi:10.1145/            havior Of Autonomous Cars, in: Ex-
     2030112.2030168.                                 tended Abstracts of the 2019 CHI Con-
[20] B. Y. Lim, A. K. Dey, Toolkit to Sup-            ference on Human Factors in Comput-
     port Intelligibility in Context-aware            ing Systems, CHI EA ’19, ACM, New
     Applications,      in: Proceedings of            York, NY, USA, 2019, pp. LBW0163:1–
     the 12th ACM International Con-                  LBW0163:6. URL: http://doi.acm.org/
     ference on Ubiquitous Computing,                 10.1145/3290607.3312817. doi:10.1145/
     UbiComp ’10, ACM, New York, NY,                  3290607.3312817, event-place: Glas-
     USA, 2010, pp. 13–22. URL: http:                 gow, Scotland Uk.
     //doi.acm.org/10.1145/1864349.1864353.      [26] E. B.-N. Sanders, P. J. Stap-
     doi:10.1145/1864349.1864353.                     pers,      Co-creation and the new
[21] B. Y. Lim, A. K. Dey, D. Avrahami,               landscapes of design,            CoDe-
     Why and why not explanations                     sign 4 (2008) 5–18. URL: http://
     improve the intelligibility of context-          dx.doi.org/10.1080/15710880701875068.
     aware intelligent systems,         ACM,          doi:10.1080/15710880701875068.
     Boston, MA, USA, 2009, pp. 2119–            [27] A. Bourazeri, S. Stumpf, Co-designing
     2128.     URL:     http://portal.acm.org/        Smart Home Technology with People
     citation.cfm?id=1518701.1519023Źcoll=            with Dementia or Parkinson’s Disease,
     portalŹdl=ACMŹtype=seriesŹidx=                   in: Proceedings of the 10th Nordic Con-
     SERIES260Źpart=seriesŹWantType=                  ference on Human-Computer Interac-
     ProceedingsŹtitle=CHIŹCFID=                      tion, NordiCHI ’18, ACM, New York,
     31206243ŹCFTOKEN=35340577.                       NY, USA, 2018, pp. 609–621. URL: http:
     doi:10.1145/1518701.1519023.                     //doi.acm.org/10.1145/3240167.3240197.
[22] M. K. Lee, D. Kusbit, A. Kahng, J. T.            doi:10.1145/3240167.3240197,
     Kim, X. Yuan, A. Chan, D. See, R. Nooth-         event-place: Oslo, Norway.
     igattu, S. Lee, A. Psomas, A. D. Pro-       [28] S. Wilson, A. Roper, J. Marshall,
     caccia,       Webuildai: Participatory           J. Galliers, N. Devane, T. Booth,
     framework for algorithmic governance,            C. Woolf,        Codesign for people
     Proc. ACM Hum.-Comput. Interact.                 with aphasia through tangible
     3 (2019). URL: https://doi.org/10.1145/          design languages,         CoDesign 11
     3359283. doi:10.1145/3359283.                    (2015) 21–34. URL: http://dx.doi.org/
[23] C. Tsai, P. Brusilovsky, Designing Ex-           10.1080/15710882.2014.997744.
     planation Interfaces for Transparency            doi:10.1080/15710882.2014.997744.
     and Beyond, in: Algorithmic Trans-          [29] T. Neate, A. Bourazeri, A. Roper,
     parency in Emerging Technologies, Los            S. Stumpf, S. Wilson,       Co-Created
     Personas: Engaging and Empowering
     Users with Diverse Needs Within the
     Design Process, in: Proceedings of the
     2019 CHI Conference on Human Fac-
     tors in Computing Systems, CHI ’19,
     ACM, New York, NY, USA, 2019, pp.
     650:1–650:12. URL: http://doi.acm.org/
     10.1145/3290605.3300880. doi:10.1145/
     3290605.3300880, event-place: Glas-
     gow, Scotland Uk.
[30] S. G. Hart, L. E. Staveland, Develop-
     ment of NASA-TLX (Task Load Index):
     Results of Empirical and Theoretical
     Research, in: Peter A. Hancock and
     Najmedin Meshkati (Ed.), Advances
     in Psychology, volume Volume 52
     of Human Mental Workload, North-
     Holland, 1988, pp. 139–183. URL:
     http://www.sciencedirect.com/science/
     article/pii/S0166411508623869.

</pre>