               Verification Staircase: a Design Strategy for
                         Actionable Explanations
                       Martin Lindvall∗                                                      Jesper Molin
       Center for Medical Image Science and Visualization                                      Sectra AB
                  Linköping University, Sweden                                             Linköping, Sweden
                         martin@ixd.ai                                                jesper.molin+iui@gmail.com

ABSTRACT                                                                1   INTRODUCTION
What if the trust in the output of a predictive model could             Machine learning (ML) techniques has potential impacts on
be acted upon in richer ways than a simple binary decision              clinical decision making in the field of digital pathology, how-
of accept or reject? Designing assistive AI tools for medical           ever, a barrier is adapting experimental results into everyday
specialists entails supporting a complex but safety-critical de-        clinical use. One issue is that while results in experimental
cision process. It is common that decisions in this domain can          settings show impressive overall results, there is usually a rel-
be decomposed to a combination of many smaller decisions.               evant subset of cases where the model performs significantly
In this paper, we present Verification Staircase – a design             worse than humans [6, 8]. Other issues such as dataset shift
strategy that can be used for such scenarios. The verification          [7] and bias [13] also motivate a model of interaction with
staircase is when multiple interactive assistive tools are com-         the predictive component positioned in the loop of human
bined to allow for a nuanced amount of automation to aid                decision making.
the user. This can support a wide range of prediction quality              In this paper, based on our experiences from a dual industrial-
scenarios, spanning from unproblematic minor mistakes to                academic perspective, we outline a design strategy that we
misleading major failures. By presenting the information in             believe can be useful to resolve some of the challenges with
a hierarchical way, the user is able to learn how underlying            designing human-ML collaborative systems.
predictions are connected to overall case predictions, and                 The primary issues addressed by our proposed design
over time, calibrate their trust so that they can choose the            strategy is enabling the user to answer questions such as:
appropriate level of automatic support.
                                                                            • When do I trust the prediction enough to use automatic
CCS CONCEPTS                                                                  support, and when should I employ another diagnostic
   • Human-centered computing → Human computer
                                                                            • How can I justify my decision if a colleague asks?
interaction (HCI); Interface design prototyping; User inter-
                                                                            • How can I feel safe in my conclusion?
face design; User centered design; • Computing methodolo-
gies → Machine learning; • Social and professional topics                  In our suggested design strategy, multiple characteristics
→ Automation.                                                           combine to enable answers to such questions, including in-
KEYWORDS                                                                the-loop correction, decomposition to allow explanations
  human-in-the-loop systems, human-ML collaboration, ex-                through causal inference and designing to afford use with
planations, interaction design                                          both high performing predictions as well as border-case
ACM Reference Format:                                                      Our insights are from ongoing human-centered design
Martin Lindvall and Jesper Molin. 2020. Verification Staircase: a       explorations. The presented perspective is rooted in our
Design Strategy for Actionable Explanations. In Proceedings of the
                                                                        experience as UX practitioners within the field of digital
IUI workshop on Explainable Smart Systems and Algorithmic Trans-
parency in Emerging Technologies (ExSS-ATEC’20). Cagliari, Italy,
                                                                        pathology, with a strong emphasis on practical relevance.
6 pages.                                                                Typically, the goal of our design effort is to make systems
                                                                        where the resulting value is co-created between artifacts
∗ Also with Sectra AB.                                                  and humans in the context of use [1]. Thus we approach
                                                                        explainability pragmatically, starting from users’ goals and
                                                                        needs. Our account is less concerned about taxonomy such
                                                                        create better patient outcomes.
ExSS-ATEC’20, March 17–20, 2020, Cagliari, Italy                                                                                         Martin Lindvall and Jesper Molin

  The layout of this paper is as follows; first we present                                                                                                      Accept or reject
and motivate the strategy of verification staircase. Second,      Verification cliff                                                                            prediction

we illustrate the concept by an explorative design case for
assisted quantification in digital pathology. Finally, we dis-
cuss our concept in the context of explainable intelligent
user interfaces and outline our proposed continuation of the
research.                                                                                            Fall back to
                                                                                                     manual review
2   FROM CLIFFS TO STAIRCASES                                                                                                                                            Accuracy
Consider a predictive model trained to assess whether a
                                                                                                                                                   "Good enough"-
patient is eligible to receive some cancer-inhibiting drug. In                                                                                     threshold
the context of digital pathology, where tissues are viewed
at high magnification, the result might be visualized in the      Figure 2: In some human-ML interfaces the user must decide
context of the area of interest as depicted in Figure 1.          to either accept or reject the prediction based on their belief
                                                                  about the underlying accuracy

                                    PD-L1 Positivity: 3%          ecological interface design [14]. We will next illustrate this
                                                                  for our pathology scenario.
                                                                     Many diagnostics tasks within pathology can be divided
                                                                  into multiple sub tasks, e.g. an overall case-level score is
                                                                  derived from a formula combining the detection and classifi-
                                                                  cation of many individual cells. Consequently, it is possible
                                                                  to measure the accuracy per diagnostic case. When predic-
                                                                  tive algorithms are evaluated, it is common that an overall
                                                                  accuracy across cases in the form of an AUC, F1-score or
                                                                  Cohen’s kappa is presented. However, in a scenario with
                                                                  case-level sub tasks, we can also characterize the distribu-
Figure 1: A diagnostic recommendation by a predictive             tion of per case accuracies over a large number of cases, see
model presented in the context of a digital pathology image       Figure 3

   For such an interaction, the user is supposed to look at the                              Accuracy distribution per case

visualization and if everything looks fine, accept the overall
                                                                           Number of cases

result. An appropriate strategy might be to trust and accept
the result if the underlying accuracy is good enough for this
particular case and reject it otherwise. If the user rejects
the result, they will need to resort to performing the task
manually. If the user interface affords no other means of
judging the underlying accuracy than the manual approach,
chances are that unless there exist very strong guarantees
that the model performs well on all possible cases, they will
always reject the result and be forced to perform their manual                               Low

                                                                                             0.00   0.10   0.20   0.30   0.40     0.50     0.60   0.70   0.80   0.90   1.00

                                                                                                                                F1 Score
   We call this kind of human-ML interaction a verification
cliff, as depicted in Figure 2                                    Figure 3: Common distribution of per case accuracies of a
   What if there instead were multiple levels at which human-     predictive model. There is a peak close to the average accu-
ML collaboration could be performed? Having modes of hu-          racy for the test set and then a long tail of cases.
man operation corresponding to nuanced levels of control
have long been recognized as important factors for interac-          The shape that is seen in the figure is typical and has
tion with automation [10, 11].                                    been observed for many applications in our research. There
   We argue that the performance characteristics of many ML       is usually a peak in the distribution corresponding to the
applications make them suitable for splitting collaboration       average accuracy and then a long tail of cases, with some
into several levels, in a similar manner to the hierarchies of    cases almost always completely failing.
Verification Staircase: a Design Strategy for Actionable Explanations                                                       ExSS-ATEC’20, March 17–20, 2020, Cagliari, Italy

   A better design strategy would be to think about how we                                           Verification staircase
                                                                                                                                                                  Accept or reject
can help our users when predictions fall within different                                                                                                         prediction
intervals on the distribution. It can in many cases be possible

                                                                        Level of automatic support
                                                                                                                                             Too many errors
to divide the design into multiple interactions, such as:                                                                                    for simple corrections

    (1) A good result visualization that can be used to quickly                                                       Predictions are                    corrections UI
                                                                                                                      mostly uselss
        verify predictions on the 0.9-1.0 span
    (2) A correction tool for small modifications of predictions                                                                  Batch correction UI
        that updates the overall result on the 0.7-0.9 span                                          Low

    (3) A semi-automatic aid not even based on the original                                                 Manual review
        predictions on the 0.4-0.7 span etc.                                                         Few

   This way we could attempt to create multiple user inter-
faces aimed at helping the user when predictions happen to

                                                                          Number of cases
fall in different positions on the accuracy distribution.
   The decision of whether to trust or not trust the prediction
would now be a question of degree - the placed trust could                                           Many

guide the choice to an interaction with an appropriate level of
                                                                          Figure 4: In a verification staircase, multiple assistive inter-
automatic support. The question then becomes: How would                   actions are combined in a way such that the user can move
the user learn in which level to place their trust?                       between them. While some levels mean more work and less
   We suggest that requiring that levels are connected, cor-              support, they give more control and a better understanding
rectable and composable together with visualizations that                 of the predictions on lower-level phenomena. Corrections at
make errors apparent, could be enough. In such a design,                  lower levels affect higher levels, and vice versa. Each level is
users should be able to dynamically move between inter-                   designed to allow the human-AI ensemble to be productive
action levels and perform corrections. Actions at one level               within an interval of (imperfect) prediction accuracy.
should immediately be reflected in the others. We argue that
this combination of actionable and composable levels will en-
able users to calibrate their trust over time, through learning             Diagnostic task
to correlate top-level observations with the suitable amount              The assisted quantification task targeted in our case study is
of drill-down behavior. We call this strategy a verification              to determine the ratio of two types of cells. Some cancers hide
staircase, as depicted in Figure 4.                                       from the immune system by a kind-of cloaking mechanism
   In the following part of this paper we will describe an                and can effectively be treated by disabling the cancerous
ongoing case study where we have instantiated this design                 cells’ ability to do this. However, not all cancers hide by
strategy for a tool that aids quantification in digital pathol-           this mechanism. In order to determine whether a patient
ogy.                                                                      shall receive this expensive treatment, cells are stained such
                                                                          that the cell membrane of cells having the cloaking ability
                                                                          becomes brown. According to the diagnostic protocol, for
                                                                          treatment to be effective more than 50% of the cancerous
                                                                          cells in the tissue should have a stained membrane. If the
Method                                                                    tissue has more than 1% stained cells, the treatment might
We followed an iterative user-centered design (UCD) method-               be effective. If stained cells are below 1%, the treatment will
ology combining sketching, high fidelity (hi-fi) prototyping,             likely not work, and the patient should not be offered the
data collection, model debugging, user observations and in-               treatment.
terviews. Pathologists and clinical experts were consulted                   Thus, the diagnostic decision is based on estimating or
throughout the process. Compared to traditional UCD, we                   counting this ratio in a possibly large tissue area. This task
used hi-fi prototyping earlier and more frequently. This is               can be time-demanding and error-prone. Pathologists can use
motivated by the difficulty of eliciting how the predictive               two basic strategies; they can look at the overall impression
output will be experienced and behave through sketches                    of the image and use their experience and tacit knowledge
and other low fidelity methods. Our account of the design                 to “intuitively” determine the percentage right away. This
process selectively highlights those insights we believe are              is a very fast decision but can be error-prone. The second
important for appropriation and adaptation of the concept                 strategy involves manually counting tumor cells both with
of verification staircase to other domains.                               and without stained membrane, and then deriving the ratio
ExSS-ATEC’20, March 17–20, 2020, Cagliari, Italy                                                                        Martin Lindvall and Jesper Molin

of the two. All things being equal, this second method will
result in a more accurate decision but is orders of magnitude
more time-demanding. As a middle ground, pathologists                                                                                   Systematic spatial grid
sometimes choose a much smaller area as a “representative                                                 Click to zoom in
sample”, and only count within that area.
   A machine learning-based predictive model has the po-
tential to always use the second strategy, classifying at the                       Drag-and-drop to reclassify
cell level and reporting the exact ratio deriving from the two

Design process                                                       There are one group per label

We interviewed and observed the working processes of pathol-
ogists performing the task manually. We also reviewed the                                                                     Reclassify by
                                                                                                                              pressing the buttons
available diagnostic protocols, where available. We collected
and manually annotated cases and then trained a convolu-
tional deep neural network to perform the predictions.
   In one possible interaction, the user can delineate an area
and receive the final result of the model as a percentage, as
was depicted in Figure 1. The type of this interaction is the
verification cliff – the user has two options; either they accept
the result blindly or they reject it and perform their usual
manual procedure. Based on the notions of a verification
staircase, we sought another interaction where, if the user          Figure 5: The UI for the first intermediate level focused on
does not accept the top-most level of automation, they could         individual classifications. The predictions are patches sam-
step down to a lower level of automatic support, that is still       pled in a grid (top right) and can be interacted with either
easier and faster than manual work.                                  in a gallery of patches (left) or in the context of the tissue
   We designed our first intermediate level for the case when        (right). The user is able to correct false classifications by
most cells have received the correct classification, but a few       clicking and dragging in the gallery, or by clicking a patch
need to be corrected for a satisfactory overall result. In the de-   in the magnified main view (bottom, right).
vised interface, the user can explore the top-level prediction
by viewing and verifying a systematic subset of decisions of
the underlying model, as depicted in Figure 5.
   At this level, the user is presented with a gallery of patches,
sampled in a systematic spatial grid, where the patches                 To support cases where the ratio is very close to a decision
are visually grouped according to whether they are con-              cut-off, the user can increase the certainty of their decisions
sidered to represent stained cells or not. For verification,         by adding patches, making the sampling grid denser.
the user can click a patch to review it in full magnification.          In evaluation with pathologists, we found that while this
The user can reclassify a patch by buttons in the magnified          design was useful for a large subset of clinical cases where
view or by drag-and-drop in the gallery. As soon as the user         the diagnosis was far from a decision cut-off, there existed
changes a patch, the final (top-level) ratio is updated (e.g.        cases where the needed precision created a grid so dense
31.4% [CI 30.0 - 32.8]).                                             that the amount of verification overwhelmed the user, and
   We considered showing the decision of the model for each          again they had to resort to the manual approach. Usually,
and every pixel point (a “heatmap”), but this does not fulfill       not being able to reach the needed certainty for the case was
our criteria for the composability of levels. Verifying and          only realized after extensive verification of many cell-level
correcting every pixel-level decision would be unfeasible for        decisions.
most humans. In order to not create a barrier to the higher             We sought to remedy this by finding another intermediary
level of the summative cell ratio, we thus limit the output of       level, that had more automatic support than the one above,
the model to grid-sampled patches. The percentage is always          but less than only getting a final percentage. To find oppor-
reported with a calculated confidence interval, reflecting the       tunities for automatic support we analyzed the bias-based
uncertainty derived from only making decisions on a subset           error in our underlying model. We found that most errors are
of the tissue’s cells.                                               somewhat systematic; visually similar patches might all be
Verification Staircase: a Design Strategy for Actionable Explanations                         ExSS-ATEC’20, March 17–20, 2020, Cagliari, Italy

assigned the “wrong” classification. For instance, the thresh-                  entire cluster. Observe the updated percentage and the
old for brown staining intensity to be considered positive                      confidence.
may differ between cases.                                                   (5) (individual correction) Check the patches in the cluster;
   Based on this, we added an algorithm for unsupervised                        does any patch “stand out” as not belonging to the
visual similarity clustering to our system and sought to de-                    cluster? Correct the patches by dragging them to the
sign the interaction such that the user could work by only                      correct category, they will automatically be assigned
making decisions on a cluster level. The user interface for                     another cluster of that type.
this mode of interaction is depicted in 6.                                  (6) (batch correction) Proceed through a few clusters, once
                                                                                no or few errors are detected, the rest is probably cor-

                                                                        We presented this multi-level version of the tool to three
                                                                        pathologists that had not been part of the design process
                                                                        in a small qualitative assessment. The three pathologists
                                                                        were presented the tool for the first time. We wanted to
                                                                        know whether the prototype could be clinically useful and
                                                                        more specifically, whether it seemed the pathologists could
                                                                        learn multi-level strategies that allowed them to balance
                                                                        detailed control, spent time and diagnostic quality. Our goal
                                                                        was primarily to assess the concept’s viability for further
                                                                        empirical efforts.
Figure 6: Patches with the same predicted label are grouped
by clusters in the left-most panel. In this example, the clus-             We found a recurring theme of initially wanting to drill-
ter with four patches should be reclassified as non-tumor.              down to cell level. Pathologists reported that they would
An experienced pathologist is able to do this just by looking           need some “alone time” to learn what kind of systematic
at the group of patches.                                                errors the prediction was making, and correlate this to the
                                                                        overall appearance of the case. When asked whether they
                                                                        thought they would be able to learn when to work at which
   In this prototype, the user can choose to look at the re-            level of detail, they were tentatively positive, but stating that
sulting percentage (e.g. 72.6% [69-76] N=3601), or to view              time would tell for certain.
the first few patches of each cluster, or to expand clusters               To us, it seemed the design had potential in allowing them
and inspect their constituent patches. Additionally, clusters           to work with sometimes inaccurate models, but also, by mov-
are ordered by uncertainty, and patches within the cluster              ing between levels. Through drill-down we hope that they
are also ordered by uncertainty. The intent is that the user            might learn to calibrate their trust towards working at the
hopefully can detect errors in only the first few clusters and          right level as appropriate. It could be that a more global,
then accept the rest.                                                   model-level understanding can be achieved by interacting
   A typical, multi-level workflow when using this would be             with local justifications like ours, over time. By contrast, user
as follows:                                                             interfaces where human-ML collaboration becomes a veri-
   (1) Open the case and initiate the use of the tool                   fication cliff does not as readily afford this, as the manual
   (2) (top level) Review the overall assigned percentage. Is           approach and the assisted are completely disjunct.
       it reasonable given the overall look of the tissue? If              While the results from such a small user study are mostly
       the confidence interval is far from a treatment cut-off,         anecdotal at this point, we are planning to evaluate this
       accept the result. Otherwise continue.                           aspect more extensively in future research.
   (3) (individual corrections) Is the grid dense? If not, start
       reviewing and correcting the patches of the top-most             4    DISCUSSION
       clusters. Observe the updated percentage and the con-            While our concept of verification staircases is early work,
       fidence. Stop when you’re making fewer corrections               we believe it has connections to many of the same issues
       per cluster.                                                     that research on explainable and transparent intelligent tools
   (4) (batch correction) If the grid is dense, and there are           seek to address.
       over 500 patches, start reviewing the top-most clus-                For instance, many of the principles outlined for Explana-
       ters; based on its first patch, does it have the correct         tory Debugging [3] are imbued in our concept. Such as: being
       classification? If not, correct the classification for the       iterative, being sound & complete, not overwhelming and
ExSS-ATEC’20, March 17–20, 2020, Cagliari, Italy                                                                        Martin Lindvall and Jesper Molin

being actionable. The major difference is that our proposed                         USA, 126–137. https://doi.org/10.1145/2678025.2701399 event-place:
explanations do not correlate predictions to the inner work-                        Atlanta, Georgia, USA.
                                                                                [4] Martin Lindvall, Jesper Molin, and Jonas Löwgren. 2018. From Machine
ings of the model, but instead to the underlying phenomena
                                                                                    Learning to Machine Teaching: The Importance of UX. Interactions 25,
viewed at different fidelities.                                                     6 (Oct. 2018), 52–57. https://doi.org/10.1145/3282860
   The need for enabling user feedback for explanations [12]                    [5] Shane T. Mueller, Robert R. Hoffman, William Clancey, Abigail Emrey,
is facilitated by excluding references to inner workings of                         and Gary Klein. 2019. Explanation in Human-AI Systems: A Literature
the model, letting the images of the domain problem always                          Meta-Review, Synopsis of Key Ideas and Publications, and Bibliography
act as the shared language to create common ground for                              for Explainable AI. (Feb. 2019). https://arxiv.org/abs/1902.01876v1
                                                                                [6] Luke Oakden-Rayner, Jared Dunnmon, Gustavo Carneiro, and Christo-
communication. It is noteworthy that this interaction affords                       pher Ré. 2019. Hidden Stratification Causes Clinically Meaningful Fail-
continuous learning of the machine learning component by                            ures in Machine Learning for Medical Imaging. arXiv:1909.12475 [cs,
enabling the corrections to become training data for future                         stat] (Nov. 2019). http://arxiv.org/abs/1909.12475 arXiv: 1909.12475.
iterations [4].                                                                 [7] Joaquin Quiñonero-Candela (Ed.). 2009. Dataset shift in machine learn-
   Enabling global model understanding through repeated                             ing. MIT Press, Cambridge, Mass.
                                                                                [8] Maithra Raghu, Katy Blumer, Greg Corrado, Jon Kleinberg, Ziad Ober-
exposure with local justifications is similar to the strategy                       meyer, and Sendhil Mullainathan. 2019. The Algorithmic Automation
employed by the LIME technique [9].                                                 Problem: Prediction, Triage, and Human Effort. arXiv:1903.12220 [cs]
   Our current design aids the user in detecting errors, e.g., by                   (March 2019). http://arxiv.org/abs/1903.12220 arXiv: 1903.12220.
sorting patches and clusters on confidence. We then rely on                     [9] Marco Tulio Ribeiro, Sameer Singh, and Carlos Guestrin. 2016. "Why
that the user will be able to learn which end of the model’s                        Should I Trust You?": Explaining the Predictions of Any Classifier.
                                                                                    In Proceedings of the 22Nd ACM SIGKDD International Conference on
accuracy distribution they are in, or at least, the suitable                        Knowledge Discovery and Data Mining (KDD ’16). ACM, New York, NY,
amount of validation effort to spend. There exist other ap-                         USA, 1135–1144. https://doi.org/10.1145/2939672.2939778 event-place:
proaches to facilitating error detection and determining the                        San Francisco, California, USA.
accuracy of classifiers [2] that could be interesting to incor-                [10] Thomas B. Sheridan. 2018.            Comments on “Issues in Hu-
porate in future versions.                                                          man–Automation Interaction Modeling: Presumptive Aspects of
                                                                                    Frameworks of Types and Levels of Automation” by David B. Kaber.
   A limitation of our current prototype is that a user’s cor-                      Journal of Cognitive Engineering and Decision Making 12, 1 (March
rection of single patches or clusters affect only the directly                      2018), 25–28. https://doi.org/10.1177/1555343417724964
involved patches, clusters and the overall ratio. We have                      [11] Thomas B. Sheridan and William L. Verplank. 1978. Human and
experimented with versions where the model is fine-tuned                            Computer Control of Undersea Teleoperators. https://doi.org/10.
using this input and the predictive output is updated, in an                        21236/ada057655
                                                                               [12] Alison Smith and James J Nolan. 2018. The Problem of Explanations
interactive machine learning manner. However, this kind of                          without User Feedback. (2018). Position paper presented at the IUI’18
global updates creates a lack of control for which we are                           Workshop on Explainable Smart Systems.
yet to find good interaction design solutions that suit our                    [13] Antonio Torralba and Alexei A. Efros. 2011. Unbiased look at dataset
safety-critical domain. We believe this is an interesting area                      bias. In CVPR 2011. 1521–1528. https://doi.org/10.1109/CVPR.2011.
of future research.                                                                 5995347 ISSN: 1063-6919.
                                                                               [14] K. J. Vicente and J. Rasmussen. 1992. Ecological interface design: theo-
                                                                                    retical foundations. IEEE Transactions on Systems, Man, and Cybernetics
ACKNOWLEDGMENTS                                                                     22, 4 (July 1992), 589–606. https://doi.org/10.1109/21.156574
This work was partially supported by the Wallenberg AI,
Autonomous Systems and Software Program (WASP)

