Verification Staircase: a Design Strategy for Actionable Explanations Martin Lindvall∗ Jesper Molin Center for Medical Image Science and Visualization Sectra AB Linköping University, Sweden Linköping, Sweden martin@ixd.ai jesper.molin+iui@gmail.com ABSTRACT 1 INTRODUCTION What if the trust in the output of a predictive model could Machine learning (ML) techniques has potential impacts on be acted upon in richer ways than a simple binary decision clinical decision making in the field of digital pathology, how- of accept or reject? Designing assistive AI tools for medical ever, a barrier is adapting experimental results into everyday specialists entails supporting a complex but safety-critical de- clinical use. One issue is that while results in experimental cision process. It is common that decisions in this domain can settings show impressive overall results, there is usually a rel- be decomposed to a combination of many smaller decisions. evant subset of cases where the model performs significantly In this paper, we present Verification Staircase – a design worse than humans [6, 8]. Other issues such as dataset shift strategy that can be used for such scenarios. The verification [7] and bias [13] also motivate a model of interaction with staircase is when multiple interactive assistive tools are com- the predictive component positioned in the loop of human bined to allow for a nuanced amount of automation to aid decision making. the user. This can support a wide range of prediction quality In this paper, based on our experiences from a dual industrial- scenarios, spanning from unproblematic minor mistakes to academic perspective, we outline a design strategy that we misleading major failures. By presenting the information in believe can be useful to resolve some of the challenges with a hierarchical way, the user is able to learn how underlying designing human-ML collaborative systems. predictions are connected to overall case predictions, and The primary issues addressed by our proposed design over time, calibrate their trust so that they can choose the strategy is enabling the user to answer questions such as: appropriate level of automatic support. • When do I trust the prediction enough to use automatic CCS CONCEPTS support, and when should I employ another diagnostic method? • Human-centered computing → Human computer • How can I justify my decision if a colleague asks? interaction (HCI); Interface design prototyping; User inter- • How can I feel safe in my conclusion? face design; User centered design; • Computing methodolo- gies → Machine learning; • Social and professional topics In our suggested design strategy, multiple characteristics → Automation. combine to enable answers to such questions, including in- KEYWORDS the-loop correction, decomposition to allow explanations human-in-the-loop systems, human-ML collaboration, ex- through causal inference and designing to afford use with planations, interaction design both high performing predictions as well as border-case accuracies. ACM Reference Format: Our insights are from ongoing human-centered design Martin Lindvall and Jesper Molin. 2020. Verification Staircase: a explorations. The presented perspective is rooted in our Design Strategy for Actionable Explanations. In Proceedings of the experience as UX practitioners within the field of digital IUI workshop on Explainable Smart Systems and Algorithmic Trans- parency in Emerging Technologies (ExSS-ATEC’20). Cagliari, Italy, pathology, with a strong emphasis on practical relevance. 6 pages. Typically, the goal of our design effort is to make systems where the resulting value is co-created between artifacts ∗ Also with Sectra AB. and humans in the context of use [1]. Thus we approach explainability pragmatically, starting from users’ goals and needs. Our account is less concerned about taxonomy such Use permitted under Creative Commons License Attribution 4.0 Interna- tional (CC BY 4.0). as distinguishing between explanations, justifications, inter- ExSS-ATEC’20, March 17–20, 2020, Cagliari, Italy pretability and transparency [5] and more on our goal of © 2020 Copyright held by the owner/author(s). creating systems that in a near future could aid clinicians to create better patient outcomes. ExSS-ATEC’20, March 17–20, 2020, Cagliari, Italy Martin Lindvall and Jesper Molin The layout of this paper is as follows; first we present Accept or reject and motivate the strategy of verification staircase. Second, Verification cliff prediction we illustrate the concept by an explorative design case for assisted quantification in digital pathology. Finally, we dis- cuss our concept in the context of explainable intelligent user interfaces and outline our proposed continuation of the research. Fall back to manual review 2 FROM CLIFFS TO STAIRCASES Accuracy Consider a predictive model trained to assess whether a "Good enough"- patient is eligible to receive some cancer-inhibiting drug. In threshold the context of digital pathology, where tissues are viewed at high magnification, the result might be visualized in the Figure 2: In some human-ML interfaces the user must decide context of the area of interest as depicted in Figure 1. to either accept or reject the prediction based on their belief about the underlying accuracy PD-L1 Positivity: 3% ecological interface design [14]. We will next illustrate this for our pathology scenario. Many diagnostics tasks within pathology can be divided into multiple sub tasks, e.g. an overall case-level score is derived from a formula combining the detection and classifi- cation of many individual cells. Consequently, it is possible to measure the accuracy per diagnostic case. When predic- tive algorithms are evaluated, it is common that an overall accuracy across cases in the form of an AUC, F1-score or Cohen’s kappa is presented. However, in a scenario with case-level sub tasks, we can also characterize the distribu- Figure 1: A diagnostic recommendation by a predictive tion of per case accuracies over a large number of cases, see model presented in the context of a digital pathology image Figure 3 For such an interaction, the user is supposed to look at the Accuracy distribution per case visualization and if everything looks fine, accept the overall High Number of cases result. An appropriate strategy might be to trust and accept the result if the underlying accuracy is good enough for this particular case and reject it otherwise. If the user rejects the result, they will need to resort to performing the task manually. If the user interface affords no other means of judging the underlying accuracy than the manual approach, chances are that unless there exist very strong guarantees that the model performs well on all possible cases, they will always reject the result and be forced to perform their manual Low method. 0.00 0.10 0.20 0.30 0.40 0.50 0.60 0.70 0.80 0.90 1.00 F1 Score We call this kind of human-ML interaction a verification cliff, as depicted in Figure 2 Figure 3: Common distribution of per case accuracies of a What if there instead were multiple levels at which human- predictive model. There is a peak close to the average accu- ML collaboration could be performed? Having modes of hu- racy for the test set and then a long tail of cases. man operation corresponding to nuanced levels of control have long been recognized as important factors for interac- The shape that is seen in the figure is typical and has tion with automation [10, 11]. been observed for many applications in our research. There We argue that the performance characteristics of many ML is usually a peak in the distribution corresponding to the applications make them suitable for splitting collaboration average accuracy and then a long tail of cases, with some into several levels, in a similar manner to the hierarchies of cases almost always completely failing. Verification Staircase: a Design Strategy for Actionable Explanations ExSS-ATEC’20, March 17–20, 2020, Cagliari, Italy A better design strategy would be to think about how we Verification staircase Accept or reject can help our users when predictions fall within different prediction High intervals on the distribution. It can in many cases be possible Level of automatic support Too many errors to divide the design into multiple interactions, such as: for simple corrections Individual (1) A good result visualization that can be used to quickly Predictions are corrections UI mostly uselss verify predictions on the 0.9-1.0 span (2) A correction tool for small modifications of predictions Batch correction UI that updates the overall result on the 0.7-0.9 span Low (3) A semi-automatic aid not even based on the original Manual review Accuracy predictions on the 0.4-0.7 span etc. Few This way we could attempt to create multiple user inter- faces aimed at helping the user when predictions happen to Number of cases fall in different positions on the accuracy distribution. The decision of whether to trust or not trust the prediction would now be a question of degree - the placed trust could Many guide the choice to an interaction with an appropriate level of Figure 4: In a verification staircase, multiple assistive inter- automatic support. The question then becomes: How would actions are combined in a way such that the user can move the user learn in which level to place their trust? between them. While some levels mean more work and less We suggest that requiring that levels are connected, cor- support, they give more control and a better understanding rectable and composable together with visualizations that of the predictions on lower-level phenomena. Corrections at make errors apparent, could be enough. In such a design, lower levels affect higher levels, and vice versa. Each level is users should be able to dynamically move between inter- designed to allow the human-AI ensemble to be productive action levels and perform corrections. Actions at one level within an interval of (imperfect) prediction accuracy. should immediately be reflected in the others. We argue that this combination of actionable and composable levels will en- able users to calibrate their trust over time, through learning Diagnostic task to correlate top-level observations with the suitable amount The assisted quantification task targeted in our case study is of drill-down behavior. We call this strategy a verification to determine the ratio of two types of cells. Some cancers hide staircase, as depicted in Figure 4. from the immune system by a kind-of cloaking mechanism In the following part of this paper we will describe an and can effectively be treated by disabling the cancerous ongoing case study where we have instantiated this design cells’ ability to do this. However, not all cancers hide by strategy for a tool that aids quantification in digital pathol- this mechanism. In order to determine whether a patient ogy. shall receive this expensive treatment, cells are stained such that the cell membrane of cells having the cloaking ability becomes brown. According to the diagnostic protocol, for 3 DESIGNING WITH THE STAIRCASE: ASSISTED treatment to be effective more than 50% of the cancerous QUANTIFICATION cells in the tissue should have a stained membrane. If the Method tissue has more than 1% stained cells, the treatment might We followed an iterative user-centered design (UCD) method- be effective. If stained cells are below 1%, the treatment will ology combining sketching, high fidelity (hi-fi) prototyping, likely not work, and the patient should not be offered the data collection, model debugging, user observations and in- treatment. terviews. Pathologists and clinical experts were consulted Thus, the diagnostic decision is based on estimating or throughout the process. Compared to traditional UCD, we counting this ratio in a possibly large tissue area. This task used hi-fi prototyping earlier and more frequently. This is can be time-demanding and error-prone. Pathologists can use motivated by the difficulty of eliciting how the predictive two basic strategies; they can look at the overall impression output will be experienced and behave through sketches of the image and use their experience and tacit knowledge and other low fidelity methods. Our account of the design to “intuitively” determine the percentage right away. This process selectively highlights those insights we believe are is a very fast decision but can be error-prone. The second important for appropriation and adaptation of the concept strategy involves manually counting tumor cells both with of verification staircase to other domains. and without stained membrane, and then deriving the ratio ExSS-ATEC’20, March 17–20, 2020, Cagliari, Italy Martin Lindvall and Jesper Molin of the two. All things being equal, this second method will result in a more accurate decision but is orders of magnitude more time-demanding. As a middle ground, pathologists Systematic spatial grid sometimes choose a much smaller area as a “representative Click to zoom in sample”, and only count within that area. A machine learning-based predictive model has the po- tential to always use the second strategy, classifying at the Drag-and-drop to reclassify cell level and reporting the exact ratio deriving from the two counts. Design process There are one group per label We interviewed and observed the working processes of pathol- ogists performing the task manually. We also reviewed the Reclassify by pressing the buttons available diagnostic protocols, where available. We collected and manually annotated cases and then trained a convolu- tional deep neural network to perform the predictions. In one possible interaction, the user can delineate an area and receive the final result of the model as a percentage, as was depicted in Figure 1. The type of this interaction is the verification cliff – the user has two options; either they accept the result blindly or they reject it and perform their usual manual procedure. Based on the notions of a verification staircase, we sought another interaction where, if the user Figure 5: The UI for the first intermediate level focused on does not accept the top-most level of automation, they could individual classifications. The predictions are patches sam- step down to a lower level of automatic support, that is still pled in a grid (top right) and can be interacted with either easier and faster than manual work. in a gallery of patches (left) or in the context of the tissue We designed our first intermediate level for the case when (right). The user is able to correct false classifications by most cells have received the correct classification, but a few clicking and dragging in the gallery, or by clicking a patch need to be corrected for a satisfactory overall result. In the de- in the magnified main view (bottom, right). vised interface, the user can explore the top-level prediction by viewing and verifying a systematic subset of decisions of the underlying model, as depicted in Figure 5. At this level, the user is presented with a gallery of patches, sampled in a systematic spatial grid, where the patches To support cases where the ratio is very close to a decision are visually grouped according to whether they are con- cut-off, the user can increase the certainty of their decisions sidered to represent stained cells or not. For verification, by adding patches, making the sampling grid denser. the user can click a patch to review it in full magnification. In evaluation with pathologists, we found that while this The user can reclassify a patch by buttons in the magnified design was useful for a large subset of clinical cases where view or by drag-and-drop in the gallery. As soon as the user the diagnosis was far from a decision cut-off, there existed changes a patch, the final (top-level) ratio is updated (e.g. cases where the needed precision created a grid so dense 31.4% [CI 30.0 - 32.8]). that the amount of verification overwhelmed the user, and We considered showing the decision of the model for each again they had to resort to the manual approach. Usually, and every pixel point (a “heatmap”), but this does not fulfill not being able to reach the needed certainty for the case was our criteria for the composability of levels. Verifying and only realized after extensive verification of many cell-level correcting every pixel-level decision would be unfeasible for decisions. most humans. In order to not create a barrier to the higher We sought to remedy this by finding another intermediary level of the summative cell ratio, we thus limit the output of level, that had more automatic support than the one above, the model to grid-sampled patches. The percentage is always but less than only getting a final percentage. To find oppor- reported with a calculated confidence interval, reflecting the tunities for automatic support we analyzed the bias-based uncertainty derived from only making decisions on a subset error in our underlying model. We found that most errors are of the tissue’s cells. somewhat systematic; visually similar patches might all be Verification Staircase: a Design Strategy for Actionable Explanations ExSS-ATEC’20, March 17–20, 2020, Cagliari, Italy assigned the “wrong” classification. For instance, the thresh- entire cluster. Observe the updated percentage and the old for brown staining intensity to be considered positive confidence. may differ between cases. (5) (individual correction) Check the patches in the cluster; Based on this, we added an algorithm for unsupervised does any patch “stand out” as not belonging to the visual similarity clustering to our system and sought to de- cluster? Correct the patches by dragging them to the sign the interaction such that the user could work by only correct category, they will automatically be assigned making decisions on a cluster level. The user interface for another cluster of that type. this mode of interaction is depicted in 6. (6) (batch correction) Proceed through a few clusters, once no or few errors are detected, the rest is probably cor- rect. Evaluation We presented this multi-level version of the tool to three pathologists that had not been part of the design process in a small qualitative assessment. The three pathologists were presented the tool for the first time. We wanted to know whether the prototype could be clinically useful and more specifically, whether it seemed the pathologists could learn multi-level strategies that allowed them to balance detailed control, spent time and diagnostic quality. Our goal was primarily to assess the concept’s viability for further empirical efforts. Figure 6: Patches with the same predicted label are grouped by clusters in the left-most panel. In this example, the clus- We found a recurring theme of initially wanting to drill- ter with four patches should be reclassified as non-tumor. down to cell level. Pathologists reported that they would An experienced pathologist is able to do this just by looking need some “alone time” to learn what kind of systematic at the group of patches. errors the prediction was making, and correlate this to the overall appearance of the case. When asked whether they thought they would be able to learn when to work at which In this prototype, the user can choose to look at the re- level of detail, they were tentatively positive, but stating that sulting percentage (e.g. 72.6% [69-76] N=3601), or to view time would tell for certain. the first few patches of each cluster, or to expand clusters To us, it seemed the design had potential in allowing them and inspect their constituent patches. Additionally, clusters to work with sometimes inaccurate models, but also, by mov- are ordered by uncertainty, and patches within the cluster ing between levels. Through drill-down we hope that they are also ordered by uncertainty. The intent is that the user might learn to calibrate their trust towards working at the hopefully can detect errors in only the first few clusters and right level as appropriate. It could be that a more global, then accept the rest. model-level understanding can be achieved by interacting A typical, multi-level workflow when using this would be with local justifications like ours, over time. By contrast, user as follows: interfaces where human-ML collaboration becomes a veri- (1) Open the case and initiate the use of the tool fication cliff does not as readily afford this, as the manual (2) (top level) Review the overall assigned percentage. Is approach and the assisted are completely disjunct. it reasonable given the overall look of the tissue? If While the results from such a small user study are mostly the confidence interval is far from a treatment cut-off, anecdotal at this point, we are planning to evaluate this accept the result. Otherwise continue. aspect more extensively in future research. (3) (individual corrections) Is the grid dense? If not, start reviewing and correcting the patches of the top-most 4 DISCUSSION clusters. Observe the updated percentage and the con- While our concept of verification staircases is early work, fidence. Stop when you’re making fewer corrections we believe it has connections to many of the same issues per cluster. that research on explainable and transparent intelligent tools (4) (batch correction) If the grid is dense, and there are seek to address. over 500 patches, start reviewing the top-most clus- For instance, many of the principles outlined for Explana- ters; based on its first patch, does it have the correct tory Debugging [3] are imbued in our concept. Such as: being classification? If not, correct the classification for the iterative, being sound & complete, not overwhelming and ExSS-ATEC’20, March 17–20, 2020, Cagliari, Italy Martin Lindvall and Jesper Molin being actionable. The major difference is that our proposed USA, 126–137. https://doi.org/10.1145/2678025.2701399 event-place: explanations do not correlate predictions to the inner work- Atlanta, Georgia, USA. [4] Martin Lindvall, Jesper Molin, and Jonas Löwgren. 2018. From Machine ings of the model, but instead to the underlying phenomena Learning to Machine Teaching: The Importance of UX. Interactions 25, viewed at different fidelities. 6 (Oct. 2018), 52–57. https://doi.org/10.1145/3282860 The need for enabling user feedback for explanations [12] [5] Shane T. Mueller, Robert R. Hoffman, William Clancey, Abigail Emrey, is facilitated by excluding references to inner workings of and Gary Klein. 2019. Explanation in Human-AI Systems: A Literature the model, letting the images of the domain problem always Meta-Review, Synopsis of Key Ideas and Publications, and Bibliography act as the shared language to create common ground for for Explainable AI. (Feb. 2019). https://arxiv.org/abs/1902.01876v1 [6] Luke Oakden-Rayner, Jared Dunnmon, Gustavo Carneiro, and Christo- communication. It is noteworthy that this interaction affords pher Ré. 2019. Hidden Stratification Causes Clinically Meaningful Fail- continuous learning of the machine learning component by ures in Machine Learning for Medical Imaging. arXiv:1909.12475 [cs, enabling the corrections to become training data for future stat] (Nov. 2019). http://arxiv.org/abs/1909.12475 arXiv: 1909.12475. iterations [4]. [7] Joaquin Quiñonero-Candela (Ed.). 2009. Dataset shift in machine learn- Enabling global model understanding through repeated ing. MIT Press, Cambridge, Mass. [8] Maithra Raghu, Katy Blumer, Greg Corrado, Jon Kleinberg, Ziad Ober- exposure with local justifications is similar to the strategy meyer, and Sendhil Mullainathan. 2019. The Algorithmic Automation employed by the LIME technique [9]. Problem: Prediction, Triage, and Human Effort. arXiv:1903.12220 [cs] Our current design aids the user in detecting errors, e.g., by (March 2019). http://arxiv.org/abs/1903.12220 arXiv: 1903.12220. sorting patches and clusters on confidence. We then rely on [9] Marco Tulio Ribeiro, Sameer Singh, and Carlos Guestrin. 2016. "Why that the user will be able to learn which end of the model’s Should I Trust You?": Explaining the Predictions of Any Classifier. In Proceedings of the 22Nd ACM SIGKDD International Conference on accuracy distribution they are in, or at least, the suitable Knowledge Discovery and Data Mining (KDD ’16). ACM, New York, NY, amount of validation effort to spend. There exist other ap- USA, 1135–1144. https://doi.org/10.1145/2939672.2939778 event-place: proaches to facilitating error detection and determining the San Francisco, California, USA. accuracy of classifiers [2] that could be interesting to incor- [10] Thomas B. Sheridan. 2018. Comments on “Issues in Hu- porate in future versions. man–Automation Interaction Modeling: Presumptive Aspects of Frameworks of Types and Levels of Automation” by David B. Kaber. A limitation of our current prototype is that a user’s cor- Journal of Cognitive Engineering and Decision Making 12, 1 (March rection of single patches or clusters affect only the directly 2018), 25–28. https://doi.org/10.1177/1555343417724964 involved patches, clusters and the overall ratio. We have [11] Thomas B. Sheridan and William L. Verplank. 1978. Human and experimented with versions where the model is fine-tuned Computer Control of Undersea Teleoperators. https://doi.org/10. using this input and the predictive output is updated, in an 21236/ada057655 [12] Alison Smith and James J Nolan. 2018. The Problem of Explanations interactive machine learning manner. However, this kind of without User Feedback. (2018). Position paper presented at the IUI’18 global updates creates a lack of control for which we are Workshop on Explainable Smart Systems. yet to find good interaction design solutions that suit our [13] Antonio Torralba and Alexei A. Efros. 2011. Unbiased look at dataset safety-critical domain. We believe this is an interesting area bias. In CVPR 2011. 1521–1528. https://doi.org/10.1109/CVPR.2011. of future research. 5995347 ISSN: 1063-6919. [14] K. J. Vicente and J. Rasmussen. 1992. Ecological interface design: theo- retical foundations. IEEE Transactions on Systems, Man, and Cybernetics ACKNOWLEDGMENTS 22, 4 (July 1992), 589–606. https://doi.org/10.1109/21.156574 This work was partially supported by the Wallenberg AI, Autonomous Systems and Software Program (WASP) REFERENCES [1] Gilbert Cockton. 2006. Designing Worth is Worth Designing. In Pro- ceedings of the 4th Nordic Conference on Human-computer Interaction: Changing Roles (NordiCHI ’06). ACM, New York, NY, USA, 165–174. https://doi.org/10.1145/1182475.1182493 event-place: Oslo, Norway. [2] Alex Groce, Todd Kulesza, Chaoqiang Zhang, Shalini Shamasunder, Margaret Burnett, Weng-Keen Wong, Simone Stumpf, Shubhomoy Das, Amber Shinsel, Forrest Bice, and Kevin McIntosh. 2014. You Are the Only Possible Oracle: Effective Test Selection for End Users of Interactive Machine Learning Systems. IEEE Transactions on Software Engineering 40, 3 (March 2014), 307–323. https://doi.org/10.1109/TSE. 2013.59 [3] Todd Kulesza, Margaret Burnett, Weng-Keen Wong, and Simone Stumpf. 2015. Principles of Explanatory Debugging to Personalize Interactive Machine Learning. In Proceedings of the 20th International Conference on Intelligent User Interfaces (IUI ’15). ACM, New York, NY,