Intelligent After-Action Review Support Tools for Large Team Training Sowmya Ramachandran & Randy Jensen Stottler Henke Associates, San Mateo CA 94010, USA sowmya@stottlerhenke.com Abstract. The challenges of intelligent tutoring are greatly amplified when ap- plied to team training. One avenue for near-term progress is to explore highly utilitarian solutions that automate certain instructional tasks, e.g. tools that sup- port human instructors and extend their abilities without the features of a full intelligent tutoring system (ITS). Human instructors bring a wealth of expertise and experience that are difficult to replicate or replace. However, their attention can be overwhelmed by the volume of data generated during simulation-based training. Intelligent tools to digest the data into intelligible and actionable forms can greatly enhance the capabilities of instructors in a team setting. This paper describes two tools designed to support team training simulation exercises. They facilitate after-action review by analyzing the simulation interaction data for pat- terns and events that are interesting from the perspective of assessment and feed- back. While instructors are still in charge of training, these tools provide im- portant automated assessment support. Such tools can serve as stepping stones on the path to realizing intelligent tutoring capabilities for team training. Keywords: simulation-based team training, after-action review, natural lan- guage processing The challenges of intelligent tutoring are greatly amplified when applied to team train- ing. The problem of inferring an individual’s knowledge state, intentions, and motiva- tions now transforms into one of inferring the hidden states of many individuals. The uncertainties are magnified because attribution must be spread across multiple partici- pants, groups or sub-groups. The challenge of communicating with one grows into the challenge of communicating with many. Furthermore, there is an added requirement of being able to process and understand the communications amongst the trainees. Add to this the need for the tutor to assess not just task performance but also team performance, and the complexity explodes for developing a team training intelligent tutoring system (ITS). Given these challenges, it is worthwhile to explore less automated, but highly utili- tarian solutions, e.g., tools that support human instructors and amplify their abilities. While addressing practical needs, the solutions so developed can contribute to the ulti- mate objective of developing team ITSs. Copyright © 2019 for this paper by its authors. Use permitted under Creative Commons License Attribution 4.0 International (CC BY 4.0). 41 Team training is an area that is very well-suited for the “Stupid Tutoring, Intelligent Humans” idea advocated in [1]. When humans conduct large team training exercises, their experience and domain knowledge give them an edge in assessing performance and skill gaps, and in providing appropriate feedback. However, the amount of data generated in such simulation-based trainers can quickly overwhelm human instructors’ attention and processing powers. Intelligent tools to digest the data into intelligible and actionable forms can greatly enhance the capabilities of instructors in the team setting. This paper explores two case studies involving tools for supporting team training by automating or facilitating instructor after-action review (AAR) tasks: (i) a tool tailored for U.S. Marine Corps combined arms training and (ii) IDA (Intelligent Diagnostic As- sistant). Both were designed to assist AAR by analyzing the interaction data generated during simulation-based exercises for patterns and events that are interesting from the perspective of assessment and feedback. In the combined arms training application, rules are used to detect significant events during an exercise and correlate these condi- tions with the communications amongst the team members. IDA approaches commu- nication analysis from a different angle and focuses on filtering, organizing, and visu- alizing communication streams to enhance human assessment and feedback. Both tools were developed in response to stated user needs. As such, they represent steps toward addressing the kinds of practical problems that instructors face with team training and providing utilities to help augment their effectives. Intelligent solutions to analyze large volumes of data into actionable intelligence that instructors can use to inform their assessment and feedback are key steps on the path to building ITSs for team training. 1 AAR Tool for Combined Arms Team Training Combined arms operations involve teams of teams coordinating maneuver with multi- ple supporting arms, including direct fire, indirect fire, and fixed- and rotary-wing avi- ation. Teamwork is required both within elements (e.g., within an artillery battery) and across elements (e.g., between an artillery battery and forward observers, or also be- tween the artillery battery and commanders). Task-specific skills like adjusting artillery fire onto a target are a part of the training. But a key objective is to practice team skills, like the successful employment of indirect fire through coordination between observers, approval authorities, and liaisons with air and other elements of the battle. Training exercises may involve 100 or more participants at various stations, carrying out their respective operational responsibilities. An intelligent AAR tool was developed for USMC combined arms training, which aims to help instructors not only to identify examples of good or bad performance from the exercise, but also to put together an effective debriefing that can convey information about the context of a training point and key participants involved. The data sources available for automated analysis include the simulation event stream, voice communi- cations, and also human-in-the-loop inputs in command and control (C2) tools used in the exercise environment. For example, the data artifacts associated with an artillery mission include radio communications during planning, C2 tool inputs specifying the 42 timing and parameters for events to be triggered in the simulation, communications again for final approval before execution, and finally the simulation events when exe- cuted. In addition to the collection of raw data from the different sources, the system performs automated speech recognition on radio communications, to diagnose causal factors for possible errors. The combination of multiple data streams for automated analysis allows the system to identify training points related to individual task-related decisions and skills, team performance factors, and instances involving both. 1.1 Applying Team Behavioral Markers to Combined Arms Training The process of constructing team performance measures for the combined arms domain involved an initial practical step of identifying the behavioral markers that are possible to collect and reason about from the available data streams, overlaid with consideration of which dimensions are most important for the domain in terms of their relationships to training objectives. The study of teamwork competencies has evolved over time, producing a variety of different models with common features, collectively providing a theoretical basis for team diagnostic measures [2, 3, 4]. In a recent meta-analysis [5], findings from key publications were combined to suggest behavioral markers to be used for intelligent team tutoring, organized into five factors: trust, collective efficacy, co- hesion, communication, and conflict management. For example, markers for collective efficacy include a combination of actions like backup behaviors, and communication artifacts like affirmative comments about the team’s ability to complete tasks. In addition to recent efforts to formalize the implementation of team tutors with con- crete domains (e.g., [6]), a significant development is the availability of increasingly sophisticated natural language processing technology that can be used to mine commu- nications data for instances of behavioral markers. For some applications, challenges remain where certain team interactions take place outside of the instrumented environ- ment (e.g., an exercise environment may be constructed with digital communications infrastructure, but team members may still occasionally call out across the room or tap someone on the shoulder), or where subtleties embedded in interactions (e.g., inflection, voice level) are simply not among the collected data. For many markers, the best avail- able methods for collection still involve the use of human observers applying subjective interpretations. Yet a growing focus of research is to identify areas where automated systems can be used to assess elements of team performance in a concrete manner from available exercise data streams and help offload instructors. In combined arms, some of the most critical behavioral markers relating to effective team operations involve communications and especially information sharing (timely, complete, clear, concise, acknowledged). For example, during a coordinated assault, maneuver units should periodically report their position, roughly every 500 meters of movement. In the training environment, the tactical networks can be easily monitored for verbal position reports, which can be compared against the actual positions of the units in the simulation for accuracy and frequency. Similarly scout teams should be reporting the location and movement of enemy units within their line of sight. A dedi- cated surveillance radio network can be monitored for reports from the scout teams and compared against ground truth from the simulation. 43 Another example is the Call For Fire (CFF), which initiates the events that ultimately cause a fire series to be executed. There are at least three key parties directly involved in a CFF – the requester, the deciding authority, and the battery that would fire the series. But any approval of requests for missions and movement should be disseminated beyond the directly involved parties; for example, the maneuver companies need to be aware of the upcoming fire missions in their area because it supports their tactical ac- tivities and may also create new danger areas. Whether a CFF is approved, denied, or approved with modified parameters, this is important information to communicate. In addition to what is communicated, a system can look for markers related to how content is communicated within or across teams. In combined arms operations there is a very specific vocabulary and syntax that should be used. The spirit of this from an operational teamwork perspective is to ensure that other team members can understand the content, especially in noisy environments with the potential for distractions and degraded signal. This also means that software tools can be constructed to look for verbal utterances that conform to the vocabulary and syntax. For example, in combined arms communications, mission approval and denial should be clearly indicated with the phrases "is approved" and "is denied," respectively. In particular, the phrase "not ap- proved" should not be used because of the potential confusion with “is approved,” and such negative examples can be easily detected with automated software. Backup behaviors relating to the factors of trust and collective efficacy also play a significant role in combined arms team performance, especially in the avoidance of battlespace conflicts. The most important example arises when there is an incorrect clearance of a fire mission or air strike which violates safety constraints for a friendly unit. When conflicts arise, often they can be attributed to an error that a team member could have corrected. One simple example relates to the fact that artillery calculations are done in metric units, but altitude guidelines are conventionally reported to the rele- vant aircraft in feet. Other examples can often be traced to a change in the timing or location of the mission or friendly unit involved in the conflict, where such a change may not have been noted by a particular team member. In such cases, typically other team members monitoring the same communications networks should be aware of the correct or most recent information, at least in approximate terms. If the aircraft “stay above” for an artillery mission is conveyed with a significant unit conversion error, this should be recognized by other team members. And fire missions approved with timings that have not been updated should also draw attention from other team members. Au- tomated measures can be implemented with triggers that start from the detection of initial errors (e.g., an approval of a mission that leads to battlespace geometry conflicts), and then look for any backup behaviors among the team. 1.2 Combined Arms Example Scenario This example walks through a vignette from a combined arms training event, where automated assessment mechanisms detect behavioral markers from the data streams, resulting in feedback produced in AAR. This was implemented as part of an operational demonstration prototype, using a training scenario based on those used for existing 44 training. The scenario contains examples of both individual and team performance er- rors that occur in the course of a combined arms offensive operation, where a Close Air Support (CAS) attack against red force tanks is planned, as blue forces on the ground move to contact. Participants occupy roles in either the Fire Support Coordination Cen- ter (FSCC) or the distributed elements in communications with the FSCC. Key FSCC decision-makers are responsible for oversight of ground movements, indirect fire sup- pression missions, and the CAS mission against the target. Early in the scenario, a tank platoon begins a movement toward the enemy target, following the convention of giving position reports at every 500 meters as they ap- proach their objective position. According to the plan, the tanks are to move to a posi- tion just outside the danger area for the planned CAS attack on the target and then halt and report position. Instead, the tank platoon leader makes the error of moving his unit past the halt position and into the danger area, and also fails to send position reports at either the planned position just outside of this area, or the current halted position inside the danger area. This results in a conflict later when the CAS mission proceeds and the tanks are inside the danger area at the time of CAS ordnance detonation. In this sequence of events, the tank platoon leader’s error is compounded by a team error in that the FSCC staff failed to anticipate the tank platoon's position and request a position report when none had come. This is a reasonable and common form of backup behavior, as the maneuver units are expected to follow patterns established as part of a coordinated operation. Relevant data include simulation data such as tank pla- toon positions over time and CAS attack and detonation events; communications data such as tank platoon position reports (and lack of elicited reports) and CAS final ap- proval; and C2 tool injections such as the CAS attack triggering to simulation events. The battlespace geometry conflict (tanks inside the active danger area from the CAS mission) is detected by assessment mechanisms in two forms. First, it is identified in a predictive manner when the CAS attack is approved and input with C2 tools. In an analysis of the attack as entered (and before it takes place in the simulation), the danger area associated with the attack ordnance and the current position of the tank platoon already constitute a predicted conflict. Later when the CAS attack takes place, the con- flict becomes realized as an actual event in the simulation. This in turn triggers analytics that look for markers relating to individual and team errors, especially from communi- cations collected on the tactical networks. The implementation uses a combination of speech recognition methods such as keyword spotting with a simple grammar to char- acterize radio transmissions. In this case, the analytics need to identify the absence of certain communications. From a team performance metrics standpoint, there is a communication deficiency in terms of information sharing when the position report is not proactively provided. There is also a deficiency in terms of collective efficacy and trust when no markers can be found for backup behavior from the FSCC staff, in failing to ask for the position report. Furthermore, within the FSCC itself, there are several staff roles beyond the individual who interacts with the tank platoon leader, who also should have been aware of the situation and also expecting confirmation of ground unit positions before giving the final clearance to aircraft on dropping ordnance. Any of them could have initiated prompting for the request of an updated position report. 45 Once concrete markers are identified for team performance factors in specific exer- cise events, the system must convey to the team and its individuals what problems oc- curred and how teamwork needs to be improved. In this domain, teamwork is primarily carried out through radio communications, so these transmissions are very important and therefore prominent in the debrief. Figure 1 below shows the automatically gener- ated information for a debrief training point, with time-synchronized timelines for com- munications and simulation events. Fig. 1. Training point timeline display In this case, two communications timelines are relevant: position report communi- cations on the CoA_TAC net, and also air communications on the TAR net. These combine visual information about transmissions and dialogs that took place, along with speech recognition results for their content. Simulation events are depicted in the bot- tom timeline, with domain-standard symbology for the CAS mission (FW1) and the tank platoon maneuver, and markings for conflicts. These timelines are provided as supplementary information for the debrief, to accompany text generated for situations like a conflict, and any individual or team performance markers of note. Training points are automatically constructed with pre-configured 3D battlespace playback using logged simulation data, and then the human instructors select information to include from the timelines, such as key communications associated with behavioral markers (e.g., the last tank platoon position report, or the CAS final approval). Similar mappings are constructed for other combined arms related conditions to be identified and debriefed. While it is a detailed process to define the rules for these measures, and also to implement the mechanisms that can analyze exercise data from multiple streams to identify behavioral markers for team performance, the benefits come in offloading the burden on instructors. Particularly in large team training events, the operational tempo is such that instructors need to meet a rapid turnaround in pre- paring AAR debriefing materials, and intelligent tools facilitate this data-rich task. 46 2 Chat Analysis Tool for Large Team Training Exercises Intelligent Diagnostic Assistant (IDA) is a tool to enhance AAR via visualization and analysis of inter-team communications during training exercises. Chat-based commu- nications are becoming convenient and common channels for teams. As such, their role will become increasingly central in large military operations. In anticipation of this de- velopment, the US Air Force (USAF) was interested in investigating the role of chat- based communications in air operations exercises and their impact on performance as- sessment and feedback. IDA is primarily a tool to help team trainers visualize chat streams from multiple perspectives so they can perform an informed and detailed as- sessment of team performance. Its analysis capabilities are in support of visualization. The research approach used in determining how to analyze and display information follows the operational planning methodology laid out in joint and USAF doctrine. The initiator for planning is normally a problem statement in the form of intelligence data or operational data reported to the team. The initiating report typically establishes a segregated planning approach to address the problem. The team then examines the problem in sequence with other planning tasks or a sub-team may be tasked to examine the issue in parallel with other team activities. In many cases, planning may be inter- rupted and take on an interleaved character. When a training session ends, trainees need to be able to see each problem in isolation, as well as in context with other workload. The isolation approach allows the team to review actual process versus doctrine, while the context of workload offers insight into time delays, distractions, errant information sources, and overall cognitive effort. AAR tools must help an instructor to sort and associate information with a unique process and be able to display information cogently to identify key areas that positively or negatively affected team and individual perfor- mance. Where chat logs are one of the primary sources of data indicating performance, tools for reviewing multiple chat logs in tandem become critical. Based on a requirement analysis, we determined that classification of chat data ac- cording to missions (or processes) is an important capability for IDA. The objective of our research was to explore the extent to which this data can be analyzed to extract useful information without a deep semantic understanding of the messages. We focused on the use of statistical and rule-based techniques that analyze messages based on sur- face features such as word occurrences and correlations. The goal was to de-clutter the communication streams so that instructors can focus on meaningful threads to assess and discuss during AAR. IDA supports two primary activities: (i) association and filtering, and (ii) visualiza- tion and browsing. 2.1 Association and Filtering The objective of the associative mapping is to identify topics on the same thread, where a thread is defined as a mission. IDA first starts out with an untagged set of chat mes- sages sorted in a chronological order. It incrementally tags the messages with associated missions using a combination of heuristic and machine learning approaches. Multiple passes are made over the chat data to successively refine the associations. It is possible 47 for a message to be associated with multiple missions. IDA performs the following two types of analyses to recognize associations. First it performs keyword-based associa- tions over two passes through the data. An important feature of this domain is that each process or mission is associated with a specific identifier that is sometimes referenced in chat utterances. While only a small fraction of the chat messages include such men- tions, these messages nonetheless form seed data from which other associations may be gleaned. Once this seed set has been identified, IDA uses supervised machine learn- ing techniques to learn classifiers to map chat utterances to a mission. The second pass through the data is then performed to classify the remaining messages using these clas- sifiers. This still typically leaves a number of unclassified messages. IDA uses temporal pattern-matching to classify these. First, a turn-by-turn interaction between two people in the same chat room within a time window (e.g., A says something to B, and 3 minutes later B says something to A) is associated with a common mission. Making an assump- tion of dialog coherence, it is likely that such conversation dyads refer to the same topic thread. Finally, the remaining messages are clustered according to the distribution of tags in the neighborhood of each message. A window around each message is analyzed, and the message is tagged with the most commonly tagged mission. 2.2 Visualization and Browsing Even with a filtered set of chat data, it is still a time-consuming task to review synchro- nous conversation streams in multiple chat rooms and develop an understanding of the overall flow to identify performance indicators. This is the motivation for a tailored browsing capability that an instructor can use to review process-specific communica- tions and visualize chronological relationships cross-referenced with exercise states. Typically, communications regarding a particular process will flow across multiple chat rooms, so synchronous browsing is a key feature. Additionally, the results of associa- tions and filtering can be reflected in the browsing environment as cues during the re- view process. For example, keywords detected by the supervised learning algorithm will often be of interest to an instructor as highlighted terms while browsing. Figure 2 shows the primary visualization view implemented with the prototype. IDA enables simultaneous, synchronous browsing of multiple chat rooms, while preserving chronology, thereby making it possible to follow the communications across time and across chat rooms. With synchronous scrolling, the user can browse through the chat data exactly as it unfolded in the exercise. IDA has simple rules for automatically con- figuring the windows based on the phase of the exercise, the mission under considera- tion, and the density of chat traffic in the rooms. 48 Fig. 2. The IDA visualization tool. The IDA chat visualizer provides information at three levels of detail. A timeline, shown along the right side of Figure 2, provides a birds-eye view of the distribution of communication in the selected chat rooms, without providing the details of communi- cations. Its purpose is to establish the overall temporal context in a manner consistent with other tools utilized on adjacent screens during the AAR process. Individual chan- nels are represented in the timeline, with independent markers to show the current tem- poral location of selected chat lines in each channel. The chat channels provide the next level of detail, each presented in a dedicated panel for a particular chat room. Each chat line for a channel is shown with the time stamp, the sender, and the first line of the message content as a scannable summary. Channels can be scrolled independently or synchronously with chat lines aligned by time. A movable magnifying lens within the channel display provides the third level of detail. It shows the entire contents of a se- lected chat line in larger text, using multiple lines if necessary. The combination of analysis, filtering and visualization is designed to facilitate rapid assessments of team performance markers by instructors. The birds-eye view of the communications about a process/mission allows instructors to assess whether infor- mation is flowing through the right channels at the expected times. Filtering and syn- chronous browsing allows looking for behavioral markers such closed-loop communi- cations. Instructors can also see patterns and tempo of communications during different phases of a process to inform their AAR. Within each topic thread, instructors can ob- serve how critical keywords (e.g., “approved”, “denied”) are propagated through dif- ferent parts of the team. 49 3 Conclusions The two systems discussed represent different points on the spectrum of intelligent au- tomation in support of team training. Whereas IDA aimed to provide a tool to help instructor-derived assessments and feedback, the AAR tool for the combined arms training exercises went a step further by automatically assessing mission events, corre- lating them with team communications, and tying the analysis to team performance behavioral markers. The users of these solutions were not seeking a fully automated tutor to imitate the capabilities of their instructors. What they desired were intelligent tools to serve as the “eyes and the ears” of the instructors, amplifying their capacities to process the data from training exercises and construct tailored feedback. Incremental development of such intelligent training support tools is one promising path towards ultimately developing advanced intelligent tutoring capabilities. 4 Acknowledgment Portions of this research were funded under a contract with the Air Force Research Laboratory, Wright-Patterson AFB. We are grateful for this support. 5 References 1. Baker, R. S. Stupid tutoring systems, intelligent humans. International Journal of Artificial Intelligence in Education, 26(2), pp. 600–614. (2016). https://doi.org/10.1007/s40593-016- 0105-0 2. Johnston, J. H., Smith-Jentsch, K. A., Cannon-Bowers, J. A. Performance measurement tools for enhancing team decision-making training. In M. T. Brannick, Salas, E., & Prince, C. (Eds.), Team performance assessment and measurement: Theory, methods, and applica- tions, pp. 311-327. Mahwah, NJ: Erlbaum. (1997). 3. Salas, E., Rosen, M. A., Burke, C. S., Goodwin, G. F. The wisdom of collectives in organi- zations: An update of the teamwork competencies. Team effectiveness in complex organi- zations. cross-disciplinary perspectives and approaches, pp. 39-79. (2009). 4. Salas, E., Shuffler, M. L., Thayer, A. L., Bedwell, W. L., Lazzara, E. H. Understanding and improving teamwork in organizations: a scientifically based practical guide. Human Re- source Management, 54(4), pp. 599–622. (2015). https://doi.org/10.1002/hrm.21628 5. Sottilare, R. A., Shawn Burke, C., Salas, E., Sinatra, A. M., Johnston, J. H., Gilbert, S. B. Designing adaptive instruction for teams: a meta-analysis. International Journal of Artificial Intelligence in Education, 28(2), pp. 225–264. (2018). https://doi.org/10.1007/s40593-017- 0146-z 6. Gilbert, S.B., Slavina, A., Dorneich, M.C., Sinatra, A. M., Bonner, D., Johnston, J., Holub, J., MacAllister, A., Winer, E.Creating a Team Tutor Using GIFT. International Journal of Artificial Intelligence in Education, 28(2), pp. 286-313. (2018). https://doi.org/10.1007/s40593-017-0151-2