=Paper=
{{Paper
|id=Vol-2068/exss9
|storemode=property
|title=What Should Be in an XAI Explanation? What IFT Reveals
|pdfUrl=https://ceur-ws.org/Vol-2068/exss9.pdf
|volume=Vol-2068
|authors=Jonathan Dodge,Sean Penney,Andrew Anderson,Margaret Burnett
|dblpUrl=https://dblp.org/rec/conf/iui/DodgePAB18
}}
==What Should Be in an XAI Explanation? What IFT Reveals==
What Should Be in an XAI Explanation? What IFT Reveals Jonathan Dodge, Sean Penney, Andrew Anderson, Margaret Burnett Oregon State University Corvallis, OR; USA { dodgej, penneys, anderan2, burnett }@eecs.oregonstate.edu ABSTRACT gap and approach our investigation. IFT is based on a predator- This workshop’s call for participation poses the question: What prey model [12]. Grounded in prior work about how people should be in an explanation? One route toward answering this seek information [3, 11], we used StarCraft II to investigate question is to turn to theories of how humans try to obtain how both the expert explainers (suppliers) and our participants information they seek. Information Foraging Theory (IFT) is (“demanders”) would navigate the information environment one such theory. In this paper, we present lessons we have as they sought to make sense of a game while it unfolded. learned about how IFT informs Explainable Artificial Intelli- In the RTS domain, players compete for control of territory gence (XAI), and also what XAI contributes back to IFT. by fighting for it. Each player raises an army to fight their opponents, which takes resources and leads players to build CCS Concepts Expansions (new bases) to gain more resources. Players also •Human-centered computing → User studies; can use resources to create Scouting units, which lets them •Computing methodologies → Intelligent agents; learn about their enemies’ movements to enable Fighting in a strategic way. For a more in-depth explanation of the domain, Author Keywords refer to [9]. Intelligent Agents; Explainable AI; Intelligibility; Content In our user study [10] investigating this domain, we gave 20 Analysis; Video Games; StarCraft; Information Foraging experienced StarCraft II players a game replay file1 to analyze and asked them to record whatever they thought were key deci- INTRODUCTION sion points (i.e., any “event which is critically important to the Explainable AI (XAI) is burgeoning to help ordinary users outcome of the game”) during the match. Participants worked understand their intelligent agents’ behavior – but many fun- in pairs, allowing us to keep them talking by leveraging the damental questions remain in order to achieve this goal. This social convention of conversing about their collaborative task. paper describes our recent progress toward one such question: Because we wanted to understand how the participants go What should be in an explanation? about assessing an intelligent agent’s decisions, we told them We have been working to answer this question in a domain that one of the players in the game was under AI control. How- often used for AI research, namely Real-Time Strategy (RTS) ever, this was not true; both players were human professionals. games, from two sides. First, to understand what a high- 1 We used game 3 of this match (http://lotv.spawningtool.com/ quality supply of explanations might contain, we conducted a 23979/) from the IEM Season XI - Gyeonggi tournament. qualitative analysis of the utterances of expert explainers [2]. Second, to understand demand for explanations in the same domain, we conducted a user study [10] to understand the questions participants formulated when assessing an intelligent 2 agent playing the popular RTS game StarCraft II [9]. Here, we focus on the latter study. There have been previous explorations into what should be in an XAI explanation [1, 6, 7, 8, 14, 15], but few such explo- rations draw upon theories of how humans problem-solve. We used Information Foraging Theory (IFT) [12] to help fill this 3 1 Figure 1. A screenshot from our study, with participants anonymized (bottom right corner). Superimposed red boxes point out: (1, bot- tom left) the Minimap, a birds-eye view enabling participants navigate © 2018. Copyright for the individual papers remains with the authors. Copying per- around the game map; (2: top left) a drop-down menu to display the mitted for private and academic purposes. Production tab for a summary of build actions in progress; (3, middle ExSS 2018, March 11, 2018, Tokyo, Japan. right) Time Controls to rewind/forward or change speed. The participants’ main task was to assess the AI’s capabilities. [2]. The results showed that the shoutcasters’ commentaries in To do so, the participants replayed the game using the built-in StarCraft games [2] matched well with the above explanation StarCraft tool, shown in Figure 1, which offers the ability to demands. In particular, shoutcasters’ utterances were mostly observe the previously recorded events. The tool provided about the What intelligibility type, with very few utterances functionality to freely navigate with the camera, pause/rewind of the Why or Why-Didn’t types. Further, the shoutcasters with time controls, and drill down into various aspects of the were remarkably consistent with each other in frequency of game state, helping participants decide how the AI was doing. using each intelligibility type. The consistency between the supply-side and demand-side results offers evidence that in After participants finished the main task, we conducted a ret- the RTS domain, What explanation content is more in demand rospective interview in two parts. In both parts, we asked than Why or Why-Didn’t. participants questions about things they had said and done, while pointing them out in the video we had just made of those Implications: Taken together, these results show that in this participants working on the task. In the first part, we navigated domain, participants placed very high value on state informa- to each decision point they identified and asked why it was tion — but not always at the same granularity, and not always so important. In the second part, we asked about selected restricted to a single moment in time. How an XAI system can navigations using questions based on previous work [11], such satisfy these explanation needs may not be straightforward, as “What about that point in time made you stop there?”. A but one of our findings suggests a way forward: Shoutcasters more detailed methodology can be found in [10]. may be usable as a gold standard. That is, the remarkable similarity between the frequency of shoutcasters’ utterances WHAT WE’VE LEARNED SO FAR: IFT → XAI (supply) and participants’ desired prey (demand) for most in- Things we’ve learned from studying Prey telligibility types suggests that XAI explanation systems in the In IFT, predators seek prey, which are the pieces of information RTS domain may be able to model their explanation content, in the environment they they think they need. In the context timing, and construction, around shoutcasters’ explanations. of XAI, such prey are evidence of the agents decision process, which are then used to create explanations for agents’ actions. Things we’ve learned from studying Paths To investigate the information participants were trying to ob- In IFT, prey exists within some patch(es), and the forager tain, we analyzed the questions that they asked each other. navigates between patches by following paths, made up of We categorized their questions according to the Lim-Dey in- one or more links. Investigating the paths participants used telligibility types [7], which separate questions into What, revealed a great deal of information about the kinds of costs What-Could-Happen, Why-Did, Why-Didn’t, and How-To. they can incur in the RTS domain when seeking information. We also added a Judgment intelligibility type to capture when participants sought a quality judgment. Traditionally, IFT looks at the navigation cost to get to a patch (here, explanation), usually in number of clicks, and Although most previous XAI research has found Why to be the cognitive cost of absorbing the necessary information in highly demanded information, our participants rarely sought the patch once there. These costs are relevant to XAI too, but Why or Why-Didn’t information. Instead, our participants our investigation discovered participants incurred significant showed a strong preference for asking What questions. cognitive effort in both path discovery and path triage. What was so interesting about What? The participants’ What Why so expensive? Professional RTS players perform several information seeking was about finding out more about state hundred actions per minute (APM), and each such action po- than they currently knew. Our participants did so primarily in tentially destroys or updates the available foraging paths. This three categories: drill down, higher level, and temporal. produces an information environment in which foraging paths Drill down Whats usually involved participants spatially nav- are numerous, rapidly updating, and have limited lifespan. igating around the map, sometimes opening up objects or Thus, our participants were faced with many more potentially menus to access more detailed game state information. E.g. useful foraging paths than they could possibly follow, and had “Is the human building any new stuff now?” The second cat- to spend significant effort just choosing a path. egory, higher-level Whats, involved trying to abstract a little Some coped with these costs by adhering to a single foraging above the details, to gain a higher level of understanding of path throughout the task, rewinding rarely. These participants the game state. E.g. “What’s going on over there?” The third minimized their cognitive costs of choosing, but paid a high category, temporal Whats, involved finding out more about information cost, because by not following other paths, they differences or similarities in state over time. E.g. “When did missed out on potentially explanatory information. Others he start building...?” chose not to pay this information cost, and instead paid a Finally, to investigate whether our distribution of What vs. navigation cost by often rewinding and pausing to spatially Why results were reasonably representative for this domain, explore. Rewinding also incurs substantial cognitive cost, as we compared our participants’ questioning (i.e., explanation more context information must be tracked — but that extra demand) against the answers (explanation supply) produced by context may provide useful explanatory power. professional explainers in this domain, namely shoutcasters 2 2 Shoutcasters are sportscasters for e-sports. They perform a similar constraint that they must analyze and explain the game in real-time, sort of analysis as our participants were doing, but with the added so they cannot pause or rewind. Figure 2. Dots show the Building-Expansion decision points each par- ticipant pair (y-axis) identified over time (x-axis). Red lines show when Expansion events actually occurred. (Participants noticed most of them.) The red box shows where Pair 4 failed to notice an event they likely wanted to note, based on their previous and subsequent behavior. Interestingly, when costs of choice were low, participants’ ex- Figure 3. (Top:) The Scouting decision points identified by our partic- planation seeking followed fairly traditional foraging patterns. ipant pairs (y-axis), with game time on the x-axis. (Bottom:) The Fight- ing decision points identified, plotted on the same axes. After Fighting For example, early in the game, participants scrutinized the events begin (red line), Scouting decision points are no longer noticed game objects carefully and in detail — a sharp contrast to often — despite important Scouting actions continuing to occur. late in the game when many more game objects and foraging Scouting. The top image in the figure shows all of the Scouting paths were present. This could suggest that as the information decision points our participants identified, while the bottom environment grows in complexity, users in this domain will image shows all of the Fighting decision points. The red seek explanations at a higher level of abstraction (i.e., a group line going through both images is the point at which combat of units as opposed to a single unit). first begins – and also the time when scouting is usually last Implications: XAI explanation system would benefit from noticed, despite being ongoing throughout the game. incorporation of an explanation recommender. Such a recom- Implications: Distractions abound, and may be systematic. mender could take into account both the human cognitive cost Some facets of the environment may elicit emotional response of considering too many paths when few can be followed, and and receive undue attention as a result. In this case, partici- the information cost of neglecting some path too long. For pants preferred investigating Fighting over Scouting. example, if the domain is well known to an explanation system a priori, such a recommender may help guide users (reducing Another implication relates to an XAI explanation system’s their cognitive cost of choosing) to the explanations that are user interface’s support for human workflow. Paths are easily the most important (reducing the information cost of missing forgotten in the presence of interruptions. Each new action important explanatory information). In this case, it appears we may interrupt the current foraging path, which leads to people know Expansions are important before any analysis occurs. forgetting things – made worse by sheer path quantity. Previ- ous research has found that To-Do Listing [5] is an effective Things we’ve learned from studying Scent and Cues strategy to help prevent users from forgetting so much. Recall that we requested that participants write down key decision points. To forage for these, they followed cues, which WHAT WE’VE LEARNED SO FAR: XAI → IFT are information features connected to links in the environment. In the previous sections, we focused on things we learned In our study, cues were the same across sessions, because about XAI by applying IFT to our data set. Now, we turn everyone replayed the same game. Unlike cues, scent is “in the other direction, since this study is the first to apply IFT the head” – it is foragers’ assessment of cues’ meaning. to XAI and the RTS domain. The RTS domain presents an extremely complex and rapidly changing environment, more Unfortunately, participants missed information that we suspect so than other IFT environments in the literature like Integrated they would have found key, because some cues distracted them. Our videos showed that what participants were looking at Development Environments (IDEs) and web sites [3, 4, 11, when they were distracted — the “distractor cues” — tended 13]. In the RTS domain, hundreds of actions happen each to be combat-oriented and affected even simple game states. minute. Further, the environment is continually affected by actions which do not originate from the forager. For example, in Figure 2, a nearly full decision column sug- As discussed, our participants were faced with many paths, gests that participants tended to agree that this decision was and had to rapidly triage which paths to follow. This presents key. A nearly full participant pair row suggests that this pair an interesting IFT challenge. Previous research [11] identi- consistently found this type of decision to be key. Thus, miss- ing dots (e.g., see the red box) correspond to times when a fied a “scaling up problem” in IFT — a difficulty estimating participant pair was distracted from a key event. value/cost of distant prey as the path to the prey became long. In our case, we observed that foraging paths were short, but In fact, the scents emanating from some types of cues seemed since so many paths are available, not much time is available to to overpower others consistently. Consider the example in make an accurate path value/cost estimate. The current study Figure 3, which shows how Fighting tended to overpower reveals a “breadth version” of this scaling problem (Figure 4). Foraging in other environments Foraging in RTS environments debugging, refactoring, and reuse tasks. ACM Transactions on Software Engineering and Methodology …. (TOSEM) 22, 2 (2013), 14. 4. W. Fu and P. Pirolli. 2007. SNIF-ACT: A cognitive model of user navigation on the world wide web. Human-Computer Interaction 22, 4 (2007), 355–412. 5. V. Grigoreanu, M. Burnett, and G. Robertson. 2010. A strategy-centric approach to the design of end-user Figure 4. Conceptual drawing of foraging in the RTS domain vs. pre- viously studied foraging. (Left): Information environments in prior IFT debugging tools. In ACM Conference on Human Factors literature: the predator considers few paths, but the paths are sometimes in Computing Systems. ACM, 713–722. very deep. This figure inspired by an IDE foraging situation in [[11] Fig. 5]. (Right): Foraging in RTS, where most navigation paths are shallow, 6. T. Kulesza, M. Burnett, W. Wong, and S. Stumpf. 2015. but with numerous paths to choose from at the top level. Principles of explanatory debugging to personalize interactive machine learning. In ACM International Turning to prey, in the XAI setting, the prey is evidence of Conference on Intelligent User Interfaces. ACM, the agent’s decision process. Establishing trust in an XAI 126–137. system requires the user to know how it behaves in many cir- cumstances. Thus, the prey is “in pieces” – meaning that bits 7. B. Lim and A. Dey. 2009. Assessing demand for of it are scattered over many patches. As previous work [11] intelligibility in context-aware applications. In ACM has shown, “prey in pieces” creates foraging challenges, be- International Conference on Ubiquitous Computing. cause finding and assembling all the bits can be tedious and ACM, 195–204. error-prone. In the model-agnostic XAI setting, IFT’s “prey 8. B. Lim, A. Dey, and D. Avrahami. 2009. Why and why in pieces” problem becomes even more pronounced, because not explanations improve the intelligibility of of the uncertain relationships between causes/effects, or even context-aware intelligent systems. In ACM Conference on whether the agent will ever behave the same way again. Human Factors in Computing Systems. ACM, 2119–2128. CONCLUSION This paper summarizes the first investigation into information 9. S. Ontañón, G. Synnaeve, A. Uriarte, F. Richoux, D. foraging behaviors shown by participants tasked with assess- Churchill, and M. Preuss. 2013. A Survey of Real-Time ing a RTS intelligent agent. Our formative studies used IFT to Strategy Game AI Research and Competition in StarCraft. inform XAI and vice versa, by examining both supply (expert IEEE Transactions on Computational Intelligence and AI explanations) and demand (user’s questions). in Games 5, 4 (2013), 293–311. Our use of the IFT lens allows us to leverage results obtained 10. S. Penney, J. Dodge, C. Hilderbrand, A. Anderson, L. from applying IFT to non-XAI domains, while also improving Simpson, and M. Burnett. 2018. Toward Foraging for the ability to transport/generalize findings among XAI do- Understanding of StarCraft Agents: An Empirical Study. mains. By connecting XAI to IFT foundations, we can bring In ACM Conference on Intelligent User Interfaces. To to XAI a real theoretical foundation based on what informa- Appear. tions humans want and how they look for it. 11. D. Piorkowski, A. Henley, T. Nabi, S. Fleming, C. Scaffidi, and M. Burnett. 2016. Foraging and navigations, ACKNOWLEDGMENTS fundamentally: developers’ predictions of value and cost. This work was supported by DARPA #N66001-17-2-4030 and In 2016 ACM International Symposium on Foundations NSF #1314384. Any opinions, findings and conclusions or of Software Engineering. ACM, 97–108. recommendations expressed are those of the authors and do 12. P. Pirolli. 2007. Information Foraging Theory: Adaptive not necessarily reflect the views of NSF, DARPA, the Army Interaction with Information. Oxford Univ. Press. Research Office, or the US government. 13. S.S. Ragavan, S. Kuttal, C. Hill, A. Sarma, D. Piorkowski, REFERENCES and M. Burnett. 2016. Foraging among an overabundance 1. S. Amershi, M. Cakmak, W. Knox, and T. Kulesza. 2014. of similar variants. In ACM Conference on Human Power to the people: The role of humans in interactive Factors in Computing Systems. ACM, 3509–3521. machine learning. AI Magazine 35, 4 (2014), 105–120. 14. S. Stumpf, E. Sullivan, E. Fitzhenry, I. Oberst, W. Wong, 2. J. Dodge, S. Penney, C. Hilderbrand, A. Anderson, L. and M. Burnett. 2008. Integrating rich user feedback into Simpson, and M. Burnett. 2018. How the Experts Do It: intelligent user interfaces. In ACM International Assessing and Explaining Agent Behaviors in Real-Time Conference on Intelligent User Interfaces. ACM, 50–59. Strategy Games. In ACM Conference on Human Factors in Computing Systems. To Appear. 15. J. Vermeulen, G. Vanderhulst, K. Luyten, and K. Coninx. 2010. PervasiveCrystal: Asking and answering why and 3. S. Fleming, C. Scaffidi, D. Piorkowski, M. Burnett, R. why not questions about pervasive computing Bellamy, J. Lawrance, and I. Kwan. 2013. An applications. In Intelligent Environments (IE), 2010 IEEE information foraging theory perspective on tools for International Conference on. IEEE, 271–276.