=Paper=
{{Paper
|id=Vol-2846/paper15
|storemode=property
|title=Team Design Patterns for Moral Decisions in Hybrid Intelligent Systems: A Case Study of Bias Mitigation
|pdfUrl=https://ceur-ws.org/Vol-2846/paper15.pdf
|volume=Vol-2846
|authors=Jip J. van Stijn,Mark A. Neerincx,Annette ten Teije,Steven Vethman
|dblpUrl=https://dblp.org/rec/conf/aaaiss/StijnNTV21
}}
==Team Design Patterns for Moral Decisions in Hybrid Intelligent Systems: A Case Study of Bias Mitigation==
Team Design Patterns for Moral Decisions in Hybrid Intelligent Systems: A Case Study of Bias Mitigation Jip J. van Stijna , Mark A. Neerincxb,c , Annette ten Teijea and Steven Vethmanc a Vrije Universiteit Amsterdam, De Boelelaan 1105, 1081 HV Amsterdam, The Netherlands b Technische Universiteit Delft, Mekelweg 5, 2628 CD Delft, The Netherlands c Netherlands Organisation for Applied Scientific Research (TNO), Kampweg 55, 3769 DE Soesterberg, The Netherlands Abstract Increasing automation in the healthcare sector calls for a Hybrid Intelligence (HI) approach to closely study and design the collaboration of humans and autonomous machines. Ensuring that medical HI systems’ decision-making is ethical is key. The use of Team Design Patterns (TDPs) can advance this goal by describing successful and reusable configurations of design problems in which decisions have a moral component and facilitating communication in multidisciplinary teams designing HI systems. For this research, TDPs were developed describing a set of solutions for a design problem in a medical HI system: mitigating harmful biases in machine learning algorithms. The Socio-Cognitive Engineering (SCE) methodology was employed, integrating operational demands, human factors knowledge, and a technological analysis into a set of TDPs. A survey was created to assess the usability of the pat- terns with regards to their understandability, effectiveness, and generalizability. Results showed that TDPs are a useful method to unambiguously describe solutions for diverse HI design problems with a moral component on varying abstraction levels, usable by a heterogeneous group of multidisciplinary researchers. Additionally, results indicated that the SCE approach and the developed questionnaire are suitable methods for creating and assessing TDPs. Keywords Hybrid intelligence, socio-cognitive engineering, value-sensitive design, bias mitigation, team design patterns, moral decision-making 1. Introduction Over the past decades, the healthcare domain has witnessed a steep increase in automation. eHealth applications allow for higher quality and more cost-effective care [1], robot-assisted surgery has proven to be effective and safe in several medical domains [2], and machine learn- ing algorithms are capable of classifying radiology images with malicious cancers better than many radiologists [3]. While automation has unprecedented potential, concerns have been voiced in the academic realm and society, pointing at undesirable effects of recent autonomous systems (e.g. [4]). It is of utmost importance to design such systems carefully to make sure In A. Martin, K. Hinkelmann, H.-G. Fill, A. Gerber, D. Lenat, R. Stolle, F. van Harmelen (Eds.), Proceedings of the AAAI 2021 Spring Symposium on Combining Machine Learning and Knowledge Engineering (AAAI-MAKE 2021) - Stanford University, Palo Alto, California, USA, March 22-24, 2021. " jipvanstijn@gmail.com (J.J. van Stijn); mark.neerincx@tno.nl (M.A. Neerincx); annette.ten.teije@vu.nl (A. ten Teije); steven.vethman@tno.nl (S. Vethman) 0000-0002-1822-8797 (J.J. van Stijn); 0000-0002-8161-5722 (M.A. Neerincx); 0000-0002-9771-8822 (A. ten Teije) © 2021 Copyright for this paper by its authors. Use permitted under Creative Commons License Attribution 4.0 International (CC BY 4.0). CEUR Workshop Proceedings http://ceur-ws.org ISSN 1613-0073 CEUR Workshop Proceedings (CEUR-WS.org) that their actions align with human goals and values. The moral component of autonomous machines has been the subject of recent research, especially in the field of machine ethics. This discipline attempts to contribute to the creation of Artificial Moral Agents (AMAs) that follow certain ethical rules [5]. An important current challenge is posed by the discipline of Hybrid Intelligence (HI), aimed at utilizing the complementary strengths of human and artificial intel- ligence, so that they can perform better than either of the two separately [6, 7]. Using a HI approach, it is key to not only include machine requirements in the study of AMAs, but also to look into the cognitive capabilities of the humans they perform teamwork with. As this is a highly multidisciplinary endeavor, it is first and foremost imperative to estab- lish a common language to talk about moral situations in HI systems. A promising format for this is the use of Team Design Patterns (TDPs): combinations of text and pictorial language to describe possible solutions to recurring design problems [8]. These patterns represent reusable and generic HI design solutions in a coherent way and are aimed at facilitating the multidis- ciplinary HI design process. Van Diggelen and Johnson [9] developed a simple and intuitive graphical TDP language, expressing different types of work, different degrees of engagement, and different environmental constraints. Van der Waa et al. [10] made a first attempt to apply this language to the moral domain, describing the task allocation of human-computer teams in morally sensitive situations. However, the TDP language has minimal sufficiency for ex- pressing human cognitive components and requirements, which is a vital element for a truly HI approach toward moral decision making in autonomous systems. Additionally, the TDP method has so far solely been used as a taxonomy and has not been utilized in the design process of HI, making it difficult to evaluate its effects. 1.1. Research aims The aim of the current research is twofold. Firstly, it attempts to advance the conceptualiza- tion of moral decision-making in HI systems in the medical domain by creating Team Design Patterns for the process of bias mitigation in a HI digital assistant for diabetes type II care. These patterns should serve as reusable entities for solving similar design challenges. This pa- per contributes to the existing TDP literature by addressing two issues: (1) the expression of both human cognitive components and AI requirements for moral decision-making and (2) the development of TDPs not only as a library of successful and reusable design solutions, but also in their application as a method for the design process HI by a multidisciplinary team. In this paper, we study the potential of TDPs as an empirical approach for involving expert knowledge from a wide range of disciplines early in the design process of HI systems. Secondly, this research aims to contribute to a scientific standard in the methodology of con- ceptualizing moral decision-making in HI systems, as there is currently no such standard yet. A documentation of the used methods for arranging and evaluating the TDPs may serve as a benchmark in the development of a universal methodological framework for the conceptual- ization of moral decision-making in human-computer teams. The two aims mentioned above resulted in two main research questions: 1. How can Team Design Patterns describe moral decision-making for bias mitigation in a medical HI system, so that they are usable by researchers from the various disciplines involved in the design of such systems? 2. Which methodological tools are suitable for the creation of Team Design Patterns? 1.2. Methodology and Research Design Though important for the universal adoption of the TDP language across disciplines, a stan- dardized set of methods for the creation of Team Design Patterns is still missing. This method- ology has several requirements. It should (1) be geared to HI by combining technical AI knowl- edge with human factors knowledge, (2) be able to incrementally improve the patterns based on its application to new use cases, (3) be able to incorporate moral values of the stakeholders involved in the patterns’ application domain. Socio Cognitive Engineering (SCE) was developed for the design of hybrid intelligent systems, combining ele- ments from cognitive engineering, user-centered design, and requirement analysis [11]. It has been implemented in a wide range of systems in various domains, including a digital support assistant for children with diabetes [12]. An overview of SCE is illustrated in Figure 1. The SCE methodology is always applied to satisfy a specific demand in a particular context. The foundation layer consists of three components. Firstly, it includes an analysis of operational demands, which revolves around inspecting the work domain of the hybrid intelligent sys- Figure 1: SCE overview. tem and the support that is needed. Additionally, it in- cludes an analysis of human factors relevant to the sys- tem, as well as an analysis of the technological principles that may be appropriate for the envisioned support. In the specification component of SCE, a number of objectives of the en- visioned system is defined. This leads to the recognition of functions of the system, which are contextualized by scenario-like descriptions of the supposed human-machine interactions. The functions are supposed to bring about certain effects, which are called claims. Lastly, the eval- uation component uses a prototype or simulation to test whether the specified functions really have the claimed effect. The results of the evaluation can then be utilized to revise and en- hance the foundation, specification and evaluation components, incrementally advancing the product. The current research was designed to follow the SCE methodology as presented in [13]. In this adaptation of the methodology, elements of value-sensitive design were included in the foundation and specification layers in order to take the stakeholders’ moral values into account throughout the design process. 2. Foundation 2.1. Operational Demands The current research employs SCE for the development of a system that aids with Diabetes Type II (DT2) care [14]. The operational demands analysis of the SCE methodology revolves around the question: What kind of support is needed in the application domain? [11] The envisioned DT2 system is to provide support to health care professionals and patients in the prevention, diagnosis, treatment, and management of the disease through models based on existing patient data and domain knowledge. These models may, for example, be geared towards predicting patient’s risk of developing diabetes or suggesting a diagnosis. Alternatively, they may predict the best type and dose of medicine or type of lifestyle change to make the disease as unintrusive as possible. Through several modules, these predictions or suggestions are presented to the relevant patient or healthcare professional. Finally, the models are improved and updated by patients’ medical and behavioral data. Our focus is on the support of moral decision-making, therefore the next step is to localize processes in this system in which actors face choices that have a moral component. This analysis was based on medical guidelines, first drafts of the envisioned system, and interviews with four experts in the domain of lifestyle-related diseases and their care. We identified three design challenges in which moral decision-making plays a large role: (1) the mitigation of harmful biases in learning algorithms, (2) the sharing of patients’ medical and behavioral data and the recording of consents, and (3) suggestions and interventions into the patients’ lifestyle. In this paper we will only discuss the former. Bias mitigation is necessary when learning models develop biases that may result in unfair treatment by the system. For example, the underrepresentation of certain ethnic groups in the input data can result in a racial bias in the system. People of those ethnic backgrounds may then receive worse care than others. The system’s developers can employ techniques to mitigate the harmful bias (see section 2.2), but this usually results in a lower average accuracy of the predictions [15]. Hence, the value-tension in this moral issue is between the system’s (average) effectivity and fair treatment of each patient. 2.2. Analysis of Technological Principles To incorporate bias mitigation in HI team design patterns, an understanding of bias mitiga- tion in AI systems is required. The science of fair machine learning is a recent and upcoming science that faces the challenge of formalizing the definition of fairness and the related bias [16]. Two types of formulations for fairness are widely used: the statistically defined group fairness and the more locally defined individual fairness [16, 17]. Group fairness measures as- sess whether subjects in a demographic group are classified similarly according to statistical measures relative to other groups in the sample or population. Measures often relate to sta- tistical parity, which requires the chance of receiving a false positive or false negative to be independent from certain sensitive features such as ethnicity, gender or sexuality [18]. Indi- vidual fairness, in contrast, entails that people with similar traits with respect to a certain task should be treated similarly. This is usually measured by a context-specific distance metric [18]. These two metrics are often at par with each other, illustrating that there is no firm consensus for a universal approach to quantifying fairness [19]. Although there is no consensus on the formalization of fairness, a substantial amount of research has focused on several measure- and context-specific solutions to tackle bias and un- fairness. In a comprehensive review of the currently available methods for reducing unfair- ness in machine learning algorithms, [20] identify three types of methods based on where the bias mitigation is performed: pre-processing, in-processing, or post-processing. Pre-processing methods focus on bias in the training data, e.g. [15], while in-processing methods aim to mod- ify the algorithm to reduce unfair prediction in the learning phase, e.g. [21] Post-processing methods make adjustments after learning to satisfy fairness constraints, e.g.[22]. Regardless of the exact formalization and method chosen, fairness is usually an objective that exists alongside a conventional utility measure, such as accuracy. These competing objectives then result in a moral question to which extent the optimization of average effectiveness is allowed to diverge from the ignorant solution to comply to the additional fairness constraint. Independent of a specific measure or method, we therefore focus on incorporating this im- portant aspect in fair machine learning and bias mitigation: the trade-off between fairness measures and overall utility [23, 24]. This provides the TDP with the flexibility and versatility necessary for the context-dependent nature of bias and new developments in the field of fair machine learning. 2.3. Human Factors Analysis McLaughlin et al. [25] identify five main cognitive processes: attention, memory, perception, decision-making, and knowledge aids. Each of these classes consists of several more specific processes. For example, attention aids can support humans in selective, orienting, sustained, or divided attention. Moral decision-making is a complex mental and social process for which philosophers and scientists do not have a univocal explanation. It is apparent, however, that the process requires all five categories specified by [25]. We aim to use these five main cognitive processes to conceptualize technological requirements in the Team Design Pattern language. 3. Specification In the specification phase of SCE, the foundational knowledge was brought together into TDPs describing possible solutions for the bias mitigation design problem. The operational demands and technology analyses were used to create the patterns’ team process configurations. The (dis)advantages of each pattern address the value tensions identified in the analysis of oper- ational demands. The taxonomy of cognitive aids [25] was used for the patterns’ human re- quirements, and the hybrid AI boxology [26] was used for to identify AI requirements. Table 1 shows a simple and common TDP to solve the bias mitigation design challenge, while Table 2 depicts a solution which requires more advanced technology. While addressing this challenge, it became clear that other design challenges could be nested within these patterns. For exam- ple, in Table 2, the human and machine agent have a joint responsibility to change the model, which presents a design challenge of its own. We designed several sub-patterns to address this sub-challenge, one of which is shown in Table 3. Table 1 Pattern 1: The Human Moral Decision-Maker Name: Human Moral Decision Maker Description: In this pattern, the machine agent solely performs a machine learning task: pre- dicting the diagnosis (e.g. diabetes) of patients. The human AI developer super- vises this process, measuring both the overall accuracy and the fairness of the pre- dictions. If the human agent thinks that the balance between these two measures is off (e.g. because people with a specific ethnic background receive significantly less accurate diabetes diagnoses), the human initiates a takeover. In this takeover, the machine stops its task, while the human changes the model. Structure: 1. Machine uses learning models to make predictions about diagnosis. 2. Human performs task supervision: Are the predictions accurate? 3. Human performs moral supervision: are the predictions accurate for every- body? Is there social discrimination based on subgroups? 4. If the balance between overall accuracy and fairness of the model is off, the human initiates a takeover, deciding whether and how to change the model. Human req.: (a) Sufficient working memory for task supervision and moral supervision (b) Sufficient moral attention to recognize morally sensitive situation (c) Sufficient moral knowledge and domain knowledge to make moral decision AI req.: (d) Machine Learning Advantages: + Human is accountable for recognizing and making moral decision + Machine does not require moral competencies Disadvantages: — Cognitive under- or overload of the human may result in missing moral choice situations or optional solutions — Human may be turned into moral scapegoat Table 2 Pattern 2: The Coactive Moral Decision-Maker Name: The Coactive Moral Decision-Maker Description: In this pattern, the machine has more moral responsibilities. The machine per- forms moral supervision on itself: it measures whether its own predictions are equally accurate for each subgroup (e.g. if diabetes diagnoses are equally accurate for patients from different etnical backgrounds). The human is on stand-by. If the machine measures bias in its own predictions, it initiates a handover. The machine explains which exceeded thresholds necessitate a moral decision (e.g. because the model is racially biased towards people with a specific ethnic background). The human and the computer then make a joint decision in changing the model. Structure: 1. Human is on stand-by 2. Machine uses learning models to make predictions about diagnosis 3. Machine performs task supervision: Are the predictions accurate? 4. Machine performs moral supervision: Are the predictions accurate for ev- erybody? Is there social discrimination based on subgroups? 5. If the balance between overall accuracy and fairness of the model is outside preset human-made boundaries, the machine initiates a handover. 6. Machine explains the moral context: which preset thresholds are exceeded? 7. Human and machine jointly decide whether and how to change the model. Human req.: (a) Sufficient trust in machine to recognize morally sensitive situations (b) Sufficient understanding of moral implications (c) Sufficient moral knowledge and domain knowledge to make moral decision AI req.: (d) Machine Learning (e) Ability to recognize morally sensitive situations (f) Ability to sufficiently explain the moral context (g) Moral decision-support Advantages: + Human is on stand-by, allowing them to do different tasks + Human is accountable for moral consequences, but can receive support Disadvantages: — Morally sensitive situations may not be recognized by the machine — Human may be biased by machine’s explanations and suggestions Table 3 Pattern 2.1: The Suggesting Machine Name: The Suggesting Machine Description: In this pattern, the machine has more moral responsibilities, while the human only has a reviewing role. In the first frame, not the human, but the machine de- cides whether a model change is desirable, based on preset conditions (e.g. if the learning model’s diabetes diagnoses are over 10% less accurate for people with a specific ethnic background). After this, the machine simulates all possible meth- ods to change the model, and their effects on the accuracy-fairness trade-off. The machine then suggests the optimal method, which the human reviews. The hu- man agent takes this into consideration, and finally picks the preferred method. Structure: 1. Machine considers whether a model change is necessary, based on preset rules. 2. If so, it initiates a transition. The machine simulates all possible methods to mitigate bias. 3. Machine provides decision support: it gives a suggestion of the optimal method. 4. Human reviews this suggestion. 5. Human picks the method to change the model. Human req.: (a) Sufficient trust in the machine agent’s suggestions AI req.: (b) Ability to recognize and take responsibility for moral choice (c) Capability to simulate the effects of all possible options (d) Sufficient moral understanding for picking the optimal choice Advantages: + Low cognitive demands for human + Clearly defined boundaries to which situations demand a moral response Disadvantages: — High demands for machine’s computing power — Risk of machine missing morally sensitive situations if it is not included in preset boundaries — Human overtrust in machine may result in little moral deliberation — Machine only considers premade set of moral responses, and cannot think ‘outside the box’ 4. Evaluation 4.1. Data collection and analysis A usability evaluation was performed with four metrics: (1) understandability, (2) coherency, (3) effectiveness, and (4) generalizability. A questionnaire was filled in by the pattern’s prospec- tive direct users: researchers and designers of HI systems. Thirty researchers and developers from various research areas at TNO were approached through a network sample. Twenty of them responded, taking around 45 minutes to fill in the questionnaire. The first section of questions consisted of five-point Likert scale questions inquiring the participant’s background knowledge in several disciplines relevant to moral decision-making in HI systems. After that, a video was presented explaining the basic elements of TDPs for moral decision-making. The next section of the questionnaire presented the bias mitigation design challenge as described in section 2.1, including its relevance, the actors involved, and the moral tension between benef- icence and fairness it entails. It then presented the proposed patterns, each with four Likert scale statements addressing the metrics mentioned above. Each of the statements was followed by the prompt ‘Please explain your answer’ and a long answer text field, resulting in a com- bination of quantitative and qualitative data for each pattern. Additionally, participants were asked which tasks or concepts were missing in or should be added to the patterns. For the quantitative part of the analysis, the Likert scale questions regarding background knowledge and the ratings regarding the metrics (understandability, coherency, effectiveness, and generalizability) of each of the patterns were analyzed. Due to the nonparametric nature of Likert scale data and the small sample size (N=20), the mode and median were used as an indication for the distribution of the responses. For the same reason, Spearman’s rank test was performed to test for correlations between variables, while Wilcoxon’s ranked sum test was used to test for statistically significant differences between ratings for the patterns. The qualitative analysis aimed to get insight into the metrics described above. Additionally, it aimed to reveal concepts and requirements that are still missing from the patterns from the perspective of their anticipated users. Hence, the qualitative data was analyzed thematically and largely data driven. The four metrics were used as predetermined themes, in which sub- themes were inferred by categorizing the responses on an increasingly abstract level. 4.2. Results Distributions of the participants’ self-reported background knowledge in the prespecified dis- ciplines showed that (self-reported) experts from all specified research disciplines were present in the participant sample. This suggests that the sample was a good representation of the user target group: researchers and designers of hybrid intelligent systems with varying knowledge in the related disciplines. Overall, the majority of respondents was positive regarding the un- derstandability of all patterns. Pattern 2, with a mode of 4, was rated less understandable than pattern 1 (p=0.055) and pattern 2.1 (p=0.011), possibly due to its higher complexity. The fact that there were no correlations between self-reported background knowledge and understand- ability suggests that the patterns were understandable for researchers and designers regardless of their area of expertise, which is a key requirement for their purpose of facilitating communi- cation between disciplines. However, the qualitative data indicated that more specific examples would benefit the understanding of some of the participants with little background in AI. Participants were largely positive regarding the coherency and generalizability of the pat- terns (with modes of 4 and 5). Ratings for both measures were stable regardless of the specific pattern, with high internal correlations (0.50