Process Mining in Information Technology Incident Management : A Case Study at Volvo Belgium Arjel D. Bautista, Syed M. Kumail Akbar, Anthony Alvarez, Tom Metzger, Marshall Louis Reaves CKM Advisors, 711 Third Avenue Suite 1806, New York, NY, USA {abautista, sakbar, tmetzger, aalvarez, mlreaves}@ckmadvisors.com Abstract. The goal of this study is to identify opportunities that improve operational performance of information technology incident management at Volvo, Belgium. Findings are derived exclusively from computational analysis of incident and problem event logs (totaling 74,544 events) from May-June 2012, provided as part of the 2013 Business Processing Intelligence Challenge. Improvements that increase resource efficiency and reduce incident resolution times and subsequently customer impacts were identified across the following areas: service level push-to-front, ping pong between support teams, and Wait-User status abuse. Specific products, support teams, organizational structures, and process elements most appropriate for further study are identified and specific analyses are recommended. We conclude that operational improvement areas can be elucidated exclusively from obfuscated event logs. 1 Introduction Incident management has attracted growing attention from process mining practitioners seeking to identify efficiency opportunities within complex business functions in recent years. Already, several studies have demonstrated the value of process mining within incident management for the purposes of improving compliance and managing risk [1,2]. Our aim is to investigate incident management with the specific objective of improving operational performance and increasing productivity. The 2013 Business Processing Intelligence Challenge (BPIC 2013) is one such opportunity to uncover sources of performance improvement in incident management by analyzing a set of real world data. 1.1 Approach and Scope The BPIC 2013 focuses on the incident and problem management procedures of Volvo IT Belgium, from which a body of data has been collected. In our analysis of this information, we sought to understand the Volvo IT service protocols in great detail and at varying levels of granularity. In doing so, we combined the use of 2 Arjel D. Bautista, Syed M. Kumail Akbar, Anthony Alvarez, Tom Metzger, Marshall Louis Reaves process mining and computational tools with traditional spreadsheet modeling techniques to generate meaningful insights from the provided data sources. 2 Materials and Methods 2.1 Description of the Data The event log consists of three sections obtained from Volvo IT Belgium. VINST cases incidents concerns the organization’s incident management segment, while VINST cases open problems and VINST closed problems contain data for the problem management system. Problems are defined as those incidents carrying a “major” impact at any point in the resolution process, or incidents that could possibly recur in the future (as judged by action owners) [3]. All three data sets contain information for cases resolved in May 2012 (with a limited number of exceptions). Each of the data sets contains analogous fields that reveal key information about steps being performed throughout the lifetime of a case [3]. Event Log # Events # Distinct Cases VINST cases incidents 65,533 7,554 VINST cases open problems 2,351 819 VINST cases closed problems 6,660 1,487 Grand Total 74,544 9,860 Table 1: Quantification of events and distinct cases in each of the three data sets. The bulk of our efforts were spent on analyzing the incident data set. We chose to prioritize our analysis and focus on incidents because they represent a majority of all cases. This enabled us to segment the data further and arrive at more pointed analysis and recommendations. Furthermore, problems typically require more tailored responses than incidents, thus inhibiting our ability to draw meaningful conclusions that can become broadly applicable business recommendations without accessing more data about the nature of the problems being resolved. Therefore, we focus our analysis on incidents. 2.2 Software Used for Analysis We procured the version of Disco made available for the purpose of this competition (Version 1.3.6; Fluxicon, Eindhoven, The Netherlands) and loaded a project set created specifically for the BPIC 2013 original XES / MXML files [4]. We used this tool to classify cases according to path and sequence qualities difficult to represent in tabular form. 3 We used Microsoft Excel (Microsoft Office 2010; Microsoft Corporation, Redmond, WA, USA) in processing of the raw event logs and to explore processed data. Excel was especially helpful for performing basic and intermediate mathematical functions. We leveraged the R software (version 3.01) with RStudio (version 0.97.449) environment for its statistical and graphical capabilities. We found both built-in and user-defined functions invaluable for preparing, analyzing, and visualizing data. 3 Data Preprocessing 3.1 Making Sense of the Raw Event Log The BPIC 2013 data set required preprocessing prior to use in analysis and generation of meaningful business insights. This data set also posed unique problems due to the level of abstraction. Below we describe some of the cleanup and processing steps we performed and the assumptions made during our analysis of the data. Unique Mapping of Action Owners The only name field given, Owner First Name (1,440 unique values in the incidents log, 240 and 585 in the open and closed problem logs, respectively), does not map uniquely to the Owner Country field. We surmised that some names might be used by multiple people in different countries. We concatenated the owner countries with first names to create a new field, Concatenated Country / Name with 1,688 distinct entities for the incidents log, and 254 and 631 for the open and closed problems logs, respectively. We did not take into account the possibility that multiple distinct entities within the same country might be using the same name–this was not possible without additional information such as employee identification numbers. Calculation of Step and Case Duration Each event is associated with a single time stamp (the instant at which a status change occurs), so we determined time elapsed by calculating the difference between status changes. Under this convention, the final status in each case (usually Completed) is considered to conclude instantaneously. Separation of Sub-statuses by Resource Input We segmented portions of case duration associated with productive time and unproductive time (not requiring input by human resources) by status and sub-status for analysis of operational performance and productivity. We considered the sub- status In Progress as productive time spent working on the case. We considered Queued–Awaiting Assignment and Accepted–Assigned as unproductive case time with no active involvement by IT resources. The different Wait statuses (e.g. Wait or Wait– Implementation) had insufficient supporting data to determine whether this was 4 Arjel D. Bautista, Syed M. Kumail Akbar, Anthony Alvarez, Tom Metzger, Marshall Louis Reaves unproductive time or time when associated organizations were providing assistance to Volvo IT. We included Wait statuses in our calculations of total time but in our analysis it was treated as “Other” time, neither productive nor unproductive. The status Completed-Resolved is time after a solution has been delivered and is thus neither productive nor unproductive time for Volvo IT. Product Groupings Ideally, we would approach this type of data by consolidating products into broader categories wherever possible to make the data set more manageable. However, this is not possible due to the lack of identifying information about products, such as functionality and design. This also prevents us from drawing conclusions based on the nature of work being performed. Linking Problems to Incidents One of the most interesting pieces of analysis we would like to conduct is determining the causal factors for behind the elevation of incidents to problems. However, this type of analysis requires a larger incident data set that encompasses cases closed before the month of May. Extraction of Service Line The service line information is embedded within the Support Team (ST) designations themselves, as most values in the Involved ST field (for example, N52 2nd) contain both a support team number and a service line designator. We extracted these values and assumed entries without an explicit service line designator belong to Service Line 1, the common name for the Service and Expert Help Desks. While most support teams are confined to handling events within a single service line, some STs do span across several lines, particularly within the incident management organization (Table 2): Service Line # of Support Teams # of Support Teams # of Support Teams Involvement (Incidents) (Open Problems) (Closed Problems) 1 Only 201 23 45 2 Only 255 117 186 3 Only 91 47 88 1 and 2 34 0 0 2 and 2.5 1 0 0 2 and 3 16 0 2 Grand Total 598 187 321 Table 2: Service line involvement for support teams in each of the three data sets 5 Hierarchy Assumptions Our understanding of the organization hierarchy stems from the description provided [3] and our analysis of the data set. We suggest the structure for mapping support teams to their respective Organization or Function: Organization or Function → Support Team → Resource. This reflects the given definition of Organization as the business area of the user reporting the problem and Function as the groupings of IT divisions. Our analysis also supports this understanding given that the Organizations and Functions do not map one-to-one. Furthermore, Support Teams do not map one- to-one with Organization or Function and Resources do not map one-to-one with Support Teams. 4 Results 4.1 Process Conformance between Organizational Lines We took a process centric approach to evaluating the conformance between organizational lines was process centric. This approach required a description of a standard process and descriptions of the standard process flows for Volvo IT incident and problem management were not included in the documentation. Through our initial analysis we discovered a standard process for both incident and problem management. This enabled us to evaluate the conformance of Organization Line (Org Line) A2 and C to the standard process and to each other. Figure 1: Highly simplified process maps of Incidents and Closed Problems. Left: Disco- generated process map of all incidents; Right: Disco-generated process map of all closed problems. Threshold settings, both process maps: Activities 20%, Paths 40% Determining Standard Case Flow We discovered the standard incident and problem management processes by leveraging Disco’s built-in process map generator with activities and paths thresholds set to 20% and 40%, respectively. This gave us a highly simplified depiction of the 6 Arjel D. Bautista, Syed M. Kumail Akbar, Anthony Alvarez, Tom Metzger, Marshall Louis Reaves path of a typical incident and problem management processes. We only examined closed problems as this gives us a depiction of the process from start to finish. The simplified incident process map shows two standard ways to close a case: via Completed–Resolved → Completed–Closed, or simply through Completed–In Call (Figure 1). The simplified closed problems process map demonstrates only the Completed–Closed route is used with any frequency, which is to be expected as problems are major or recurring incidents which cannot be solved. Process Conformity with Respect to Incidents We tested conformance of Org Lines A2 and C more rigorously by broadening the scope to encompass more variation. We identified the 8 most important steps and simplified the process map considerably, while still maintaining 99% case coverage, by setting the activities and path thresholds to 55% and 35% respectively. To ensure that any differences in the process were due to the differences between Org Line A2 and C we examined cases where only one Org Line, A2 or C, was involved. Since process maps generated by Disco can be difficult to compare visually, we have chosen to represent them as adjacency matrices of case frequency. The adjacency matrix denotes the number of cases for which the event in the column followed the event in the row at least once in the case. 7 Figure 2: Org Line C handles the vast majority of cases terminating in Completed-In Call. First second Org Line C’s adjacency matrix and second all incidents’ adjacency matrix. The number denotes the number of cases for which the sub status in the column followed, at any point, the sub status in the column. When we compared these adjacency matrices to the simplified process map, we notice that of the 1,882 cases concluding with Completed-In Call, 1,800 (95.6%) are cases that involve Org Line C alone (Figure 2). The Completed–In Call designation is used whenever a service request is completed during a call to the help desk (Service Line 1). This suggests that Org Line C is the primary line responsible for Help Desk cases as it resolves roughly 96% of all cases that finish via Completed–In Call. In contrast, Org Line A2, and all the other Org Lines, handle very few help desk cases. Figure 3: Org Line C and Org Line A2 use most sub statuses with roughly the same frequency. This matrix shows the ratio of the normalized case frequencies. Ratios greater than one denote steps more frequently used by Org Line C, and less than one denotes steps more frequently used by Org Line A. Asterisks denote transitions traversed by less than 5% of cases for both Org Line A2 and C. 8 Arjel D. Bautista, Syed M. Kumail Akbar, Anthony Alvarez, Tom Metzger, Marshall Louis Reaves To further assess the difference between Org Lines A2 and C, we eliminated cases ending in Completed/In Call. We generated process maps with the same threshold levels as before (Activities 55%, Paths 35%). Since Org Line C handles many more cases than Org Line A2 we normalized the number of transitions by the total number of cases. We compared the two Org Lines by simply taking a ratio of the normalized number transitions, see Figure 3 above. With few exceptions the ratios are near 1 which indicates that there is little other deviation between Org Lines A2 and C. We noted that Org Line C tends to utilize the sub-status Queued–Awaiting Assignment roughly 30% more frequently than Org Line A2 and the sub-status Accepted–Assigned nearly twice as often (Figure 3). We also note that both Org Line A2 and C conform the standard process from Accepted–In Progress → Completed–Resolved → Completed–Closed. Process Conformity with Respect to Closed Problems There are far fewer closed problems than incidents, so we were able to examine the process maps at a much greater level of detail. For these particular process maps, we set the activities and paths thresholds at 100% and 90%, respectively. We used the same normalization and ratio analysis we used on incidents excluding the cases ending in Completed–In Call. Figure 4: Org Line C has twice the proportion of problems that have sub status Queued- Awaiting Assignment. This matrix shows the ratio of the normalized case frequencies. Ratios greater than one denote steps more frequently used by Org Line C, and ratios less than one denote steps more frequently used by Org Line A. Asterisks denote transitions traversed by less than 5% of cases for both Org Line A2 and C. 9 Discussion of Process Conformance Analysis We established standard process flows for both incidents and problems. We demonstrated that the primary difference between Org Line A2 and C is that Org Line C handles the vast majority of Completed–In Call incident cases. Excluding these cases the processes are roughly equivalent, with few exceptions. When handling incidents Org Line C has a rate of Accepted–Assigned 80% higher than Org Line A2. When handling problems a case handled by Org Line C is twice as likely to use the sub status Queued–Awaiting Assignment while a case handled by Org Line A2 is twice as likely to use the sub status Accepted–Assigned (Figure 4). Beyond these differences, Org Lines A2 and C seem to follow the standard incident and problem management process. 4.2 Push-to-Front Our Understanding of Push-to-Front Push-to-front (PTF) behavior is defined as incident reaching resolved by first-line personnel (hereafter referred to as Service Line 1) without involvement from higher- line support teams (Service Lines 2 and 3). Push-to-front resolution is preferred in modern IT incident management as it minimizes interruption of the duties normally performed by Service Lines 2 and 3, which typically do not include product support. We analyzed the push-to-front issue though segmenting cases by initial org line, function, product, and country of origin. This strategy allowed us to identify org lines that handle primarily PTF (e.g. Org Line C) and elevated start cases (e.g. Org Line A2), as well as recognize the fact that a majority of functions are centered on elevated start incidents. We also identified products that are particularly prone to push-to-front resolution, and those that might benefit from reassignment to other case types. Figure 5: Distribution of all 7,554 incidents into one of six resolution types. Incidents classified according to org line composition and push-to-front behavior 10 Arjel D. Bautista, Syed M. Kumail Akbar, Anthony Alvarez, Tom Metzger, Marshall Louis Reaves With this in mind, we classified each completed incident into one of six resolution types, as depicted in Figure 5. Push-to-Front Behavior by Initial Org Line We segmented the completed incidents by their initial org lines, in order to more fully understand the role of these lines in handling cases of a particular resolution type (Table 3).The types of cases assigned to org lines varies significantly. With respect to Org Lines C and A2 (the lines to which ~86% of cases are initially assigned), the former primarily handles single org line, push-to-front cases (3,777 total incidents, 65.7% of line total) whereas the latter focuses mainly on single org line cases originating at Service Lines 2 or 3 (“escalated cases”; 362 total incidents, 48.7% of line total). Org Line B appears to exhibit “C-like” behavior in that 168 of its 290 initially assigned cases (57.9%) are eventually resolved in a single org line, push-to-front fashion. One interesting observation is that these B-assigned incidents originate primarily from countries outside the European Union and North American regions, namely Australia, Brazil, China, India, Malaysia and Russia. Perhaps this line serves as an auxiliary support unit equipped to handle routine PTF incidents, so as not to overwhelm other lines handling more complicated calls from higher volume regions. Push to Push-to- Elevated Elevated First # of Escalated, Escalated, Front, Front, Start, Start, Org Completed Single Org Multiple Single Org Multiple Single Org Multiple Line Cases Line Org Line Line Org Lines Line Org Lines C 5,746 66% 3% 9% 20% 2% 0% A2 744 16% 7% 10% 10% 49% 9% Other 419 27% 17% 0% 55% 0% 1% B 290 58% 3% 11% 17% 11% 0% G4 157 0% 0% 0% 0% 73% 27% V2 69 45% 0% 41% 7% 7% 0% G2 37 14% 22% 0% 65% 0% 0% V5 21 0% 0% 0% 14% 33% 52% G1 18 0% 0% 0% 0% 50% 50% F 14 0% 0% 0% 36% 7% 57% V11 13 0% 23% 0% 54% 0% 23% H 7 71% 0% 0% 14% 0% 14% Misc. 11 0% 0% 0% 27% 0% 73% Total 7546 56% 4% 8% 20% 9% 2% Table 3: Distribution of completed incidents by first Org Line and resolution type. Leading resolution types for each Org Line are highlighted in bold. 11 While most escalated cases are handled as part of the portfolio of incidents managed by Org Line C (1,129 total incidents, 19.6% of total), a number of org lines appear to also specialize in the handling these incidents, notably G2, V2 and the undesignated org line “other”, which itself could be instances of org lines listed elsewhere. Push-to-Front Behavior by Support Team Function / Division (ST Function Div) In a manner similar to that used for org line analysis, we segmented all 7,546 completed incidents by the identity of their initial functions (Table 4). In contrast to the varied specializations exhibited by the different org lines, most of the functions (17 of 21) appear to specialize in single org line, elevated start incidents. Only two, E_5 and V3_2 have strong specialization tendencies toward push-to-front cases, while a third, A2_1, has a fairly even distribution between these types of incidents and escalation (specifically multiple org line) cases. Push-to- Elevated Elevated Push to First ST # of Front, Escalated, Escalated, Start, Start, Front, Function Completed Multiple Single Multiple Single Multiple Single Div Cases Org Org Line Org Line Org Org Org Line Lines Line Lines V3_2 4,802 70.8% 2.7% 9.2% 17.1% 0.0% 0.1% A2_1 986 38.6% 11.9% 9.4% 33.4% 5.1% 1.6% _ 836 19.7% 9.0% 3.5% 30.5% 27.2% 10.2% E_5 421 60.8% 0.0% 12.1% 25.2% 1.9% 0.0% A2_4 159 0.0% 0.0% 8.2% 1.3% 78.0% 12.6% D_1 89 0.0% 0.0% 0.0% 5.6% 91.0% 3.4% A2_2 73 1.4% 0.0% 0.0% 0.0% 74.0% 24.7% E_6 50 4.0% 0.0% 0.0% 10.0% 46.0% 40.0% A2_3 48 0.0% 0.0% 0.0% 0.0% 87.5% 12.5% A2_5 25 40.0% 4.0% 0.0% 0.0% 44.0% 12.0% E_10 21 4.8% 0.0% 0.0% 4.8% 76.2% 14.3% C_6 13 0.0% 0.0% 0.0% 0.0% 100.0% 0.0% E_1 5 0.0% 0.0% 20.0% 0.0% 60.0% 20.0% E_8 5 0.0% 0.0% 0.0% 0.0% 80.0% 20.0% Misc. 13 0.3% 0.0% 0.0% 0.5% 92.3% 7.7% Total 7546 55.9% 4.3% 8.3% 20.2% 8.9% 2.4% Table 4: Distribution of completed incidents by first ST Function Div and resolution type. Leading resolution types for each function are highlighted in bold. Misc includes all other STs not otherwise included in the table. # of Completed # of Push-to- % of Completed Product Product Cases Front Cases Cases Exhibiting 12 Arjel D. Bautista, Syed M. Kumail Akbar, Anthony Alvarez, Tom Metzger, Marshall Louis Reaves PTF Behavior 566 158 424 684 77.6% 832 39 660 442 91.3% 369 30 383 193 94.1% 505 20 253 172 76.4% 420 19 566 158 100.0% 522 15 494 142 76.3% 732 15 13 107 81.7% 533 14 321 94 87.0% 794 14 267 79 66.4% 53 13 453 77 83.7% Table 5: Top 10 products Table 6: Products with the highest number of exhibiting 100% PTF behavior push-to-front incidents Push-to-Front Behavior by Product Since many products are represented by only a single incident we sought to simplify our analysis by setting a minimal case threshold while still representing most incidents, resulting in 226 products comprising 6,724 of the original 7,546 completed cases (89.1% coverage). Under this threshold, a number of products exhibit strong PTF behavior, both by the proportion and absolute number of PTF cases handled (Tables 5 and 6). On the opposite end of the spectrum, there are a number of products that begin their lives at Service Line 1 but are eventually escalated to the higher service lines prior to completion. For some of these particularly high-volume products (Table 7), perhaps a re-designation as “elevated start” may prove beneficial in terms of time saved and a decrease in overall complexity for these cases. # of Completed Product Cases 542 75 604 36 295 32 337 28 54 27 818 27 308 20 488 18 631 18 591 17 Table 7: Top 10 products exhibiting 0% PTF behavior 13 As highlighted in the VINST user manual, “recording a solution also makes it possible for you to resolve similar SR’s without doing extensive research” [5]. However, “Solutions are objects in the database that are separate from Service Requests” and require users to link associate solutions to service requests. To promote push-to-front behavior and decrease total work we could require or incentivize the addition and association of solutions to all incidents. Figure 6: Correlation between push-to-front behavior and number of completed incidents, grouped by product. Products are first segmented into bins by number of cases (5- 50, 51-100, 101-200, and greater than 201) and then divided into quartiles by push-to-front frequency (0-25%, 25-50%, 50-75%, and 75-100% of cases exhibiting push-to-front). The percentage of products per quartile and number of products per quartile are indicated above each bar. Finally, we focused on the PTF behavior of individual products, and learned that some are certainly more prone to PTF resolution (Figure 6) than others (Table 7). Additionally, we evaluated the behavior of those incidents that are not push-to-front in nature (that is, involve some escalation to Service Lines 2 or 3 during their lifetime) to see if any opportunities exist for streamlining the handling of these cases. For example: - Are there specific products for which a large percentage of reported cases begin at Service Line 1, but are eventually escalated to Service Lines 2 or 3? - Are there cases that spend a very short time at Service Line 1 before being escalated? To answer these questions, we categorized the products according to the percentage of cases that are eventually escalated from Service Line 1, and the average elapsed time 14 Arjel D. Bautista, Syed M. Kumail Akbar, Anthony Alvarez, Tom Metzger, Marshall Louis Reaves (in minutes) from the first recorded event to the time of escalation. We then filtered out those products containing fewer than five total incidents (see above) and isolated those with ≥75% escalated cases (out of a total of 10 or more completed incidents) and <20 minutes spent at Service Line 1 prior to first escalation. # of Average Time (minutes) Standard Deviation, Time % of Cases Escalated Product Completed Spent at Service Line 1 Spent at Service Line 1 to Service Lines 2 or 3 Cases Prior to First Escalation Prior to First Escalation 488 18 100.0% 1.3 5.6 x 10-4 238 11 81.8% 4.0 2.5x10-3 431 10 90.0% 4.2 1.6x10-3 726 14 100.0% 10.2 5.5x10-3 655 10 90.0% 11.1 9.6x10-3 542 75 98.7% 11.5 1.6x10-2 305 29 93.1% 17.6 1.7x10-2 Table 8: Potential candidates for re-designation as elevated start products To this end, we identified seven products (Table 8) that might benefit from re- designation to higher service lines upon submission to the incident management system, bypassing the initial handling at Service Line 1 and possibly streamlining their resolution as “elevated start” products. Push-to-Front Behavior by Country of Origin Finally, we examined the push-to-front behavior of cases by country of origin, in order to evaluate whether specific countries may be responsible for a disproportionate number of cases from a certain resolution type. As shown in Figure 7 below, three countries, Poland, USA and Brazil, are strong originators (≥65% of cases belong to a single resolution category) of push-to-front incidents, while a fourth, Canada, produces almost exclusively elevated start cases: While >98% of incidents from the Netherlands are escalations, all but two of these belong to a single product, 542, which suggests that this behavior is due to the product itself rather than on practices being followed by the country’s reporting staff. However, in Canada, a location from which a much larger number of products (27 in total) is reported, 65 of 66 (>98%) are elevated start cases, thus suggesting the opposite. As the ambiguous nature of the source data makes it difficult to make hypotheses about the products themselves, additional information is necessary in order to make definitive conclusions about the behavior of countries toward the incidents they handle. 15 Country of # of Completed Push-to-Front Escalated Elevated Start Origin Cases Poland 1,762 81% 17% Poland USA 779 73% 22% 5% Brazil 311 67% 29% Sweden 2,954 60% 24% 16% Russia 45 51% 49% Netherlands 58 98% India 490 21% 61% 18% China 100 41% 55% Belgium 482 45% 48% 6% France 306 35% 48% 17% South Korea 65 34% 46% 20% Canada 66 99% Figure 7: Push-to-front behavior of completed cases by country of origin. Not included in this table are low-volume countries Malaysia (26 cases), Australia (25), UK (22), Japan (15), Thailand (6), Chile (3), Peru (2) and Turkey (2). Also not included: 26 cases with no country of origin attached. Total coverage, excluding these entities: 98.3% Conclusions and Potential Opportunities for Improvement Our analysis identified the specialization tendencies of the org lines initially assigned to cases in the incident management system, which proved to be widely distributed among push-to-front, escalated, and elevated start cases. Most importantly, we identified a distinct difference between the cases initially assigned to Org Lines A2 (mainly elevated start) and C (push-to-front), the two lines to which ~86% of cases are assigned upon submission to VINST. We also learned that while a majority of the functions (17 of 21) specialize in elevated start cases, only three of the remaining four have a tendency to handle push-to-front incidents. With respect to countries of origin, we learned that the nature of cases handled vary widely between the three resolution types, but given the state of the source data, were unable to conclude whether these observations were due to the product and problem profiles of the reporting locations or the reporting tendencies of the countries themselves. A more detailed description of the resolution process would allow us to measure process conformity of cases belonging to similar or identical products, and evaluating the overall process by which all incidents are handled. 16 Arjel D. Bautista, Syed M. Kumail Akbar, Anthony Alvarez, Tom Metzger, Marshall Louis Reaves 4.3 Ping Pong Behavior Our Understanding of Ping Pong Behavior Many cases are handled with the involvement of a single support team while others require the involvement of additional support teams to reach a satisfactory resolution. We analyzed the occurrence of ping pongs between support teams, which we defined as any time a support team works on a specific case more than once following a transfer between support teams. This definition accounts for direct ping pongs (A → B → A) and indirect ping pong cycles (A → B → C → A). The documentation for BPIC 2013 describes ping pongs as an “unwanted situation” so we leveraged the provided data to assess the business impact of ping pongs [3]. We first evaluated the impact of ping pongs on case completion time, and then determined which support teams are most responsible for ping pongs. We identified the products with the highest ping pong rate and the impact that a targeted initiative aimed at reducing ping pongs could have on the work time of support teams at Volvo IT. We recognize that certain processes may require support teams to transfer a case back and forth for various legitimate reasons; however, we are unable to distinguish definitively between legitimate ping pongs that follow process design and illegitimate ping pongs from this data. Nonetheless, we point to concentrations of activity. The Impact of Ping Pongs on Completion Time To determine the effect of ping pongs on incident duration, we compared the mean durations of incidents with ping pongs to those without. To ensure that our calculations were independent of variation stemming from product differences, we analyzed incidents (5,893 incidents, 78% of total) concerning the 205 products that had incidents with and without ping pongs. These results were split into deciles as shown in Figure 8. Cases exhibiting ping pong were on average 2.3-fold longer that those without ping pongs while holding product constant and excluding the top and bottom deciles (Figure 8). Incident-weighted mean case durations were 201.0 and 465.6 hours, for those without and with ping pongs, respectively. Both the first and tenth decile warrant further analysis as ping pong cases in the top decile are associated with case durations orders of magnitude above the other nine deciles. 17 Figure 8: The mean duration of incidents with ping pongs is longer than incidents without ping pongs for c.80% of products. Product deciles by fold-change in mean case duration for when cases exhibit at least one ping pong (n=205). The top decile is plotted on right-side axis. The dotted line indicates equivalency of mean case duration with and without ping pong. Other includes all status and sub-status designations not specifically listed. We also determined the average portion of time incidents spend in steps of each sub- status. We considered In Progress to be the status that identifies the actual work effort by the Volvo IT support teams and Queued and Assigned as indications of unproductive process time that increase the total resolution time. The distribution of time spent in steps of these sub-statuses varies across deciles. Figure 9: Ping pongs increase the average case time to roughly 2 times while increasing work time and unproductive wait time to roughly 5 times. Source of change analysis between average incidents without ping pongs and average incidents with a ping pong. Average times are in hours and the additional times are the difference between the average time for each respective status in the average incident without a ping pong and the average time for this status designation in an average ping pong case. 18 Arjel D. Bautista, Syed M. Kumail Akbar, Anthony Alvarez, Tom Metzger, Marshall Louis Reaves The increase in the time spent in these steps has two separate but related impacts: 1) increased In Progress time adds work for support teams, and 2) increased Queued, Assigned and Other time increases the total incident time affecting customer and potentially SLAs. The total impact of the additional 40.3 In Progress hours for 952 cases is 38,366 hours of support team work. A significant portion of this work is likely unnecessary rework that burdens support teams and reduces productivity. Likewise, customers waited an additional 93,201 hours for cases to complete. This likely impacts service level agreements (SLAs), and we would be able to more fully assess this impact with descriptions of the SLAs. Ping Pong Activity by Support Team We computed both the absolute number of ping pongs and their frequency attributable to support teams, Org Lines, and functional divisions. The ping pong frequency was defined as the ratio of ping pong events to the total number of transfers between STs. # of # of Ping Ping Pong Function Org Line Transfers Pongs Frequency Service Line 1 (total = 231) 11,113 2,107 0.19 D4 546 336 0.62 A2_1 A2 G97 1,233 236 0.19 V3_2 C D5 408 210 0.51 A2_1 C D8 402 170 0.42 A2_1 A2 D2 346 162 0.47 A2_1 C D7 205 106 0.52 A2_1 A2 D1 181 103 0.57 A2_1 B G96 1,686 69 0.04 V3_2 C D6 124 61 0.49 A2_1 C G92 226 58 0.26 E_5 C S49 148 58 0.39 V3_2 C Line 1 Subtotal (5% of total) 5,505 1,569 0.29 (50%) (75%) Service Line 2 (total = 306) 4,500 974 0.22 V37 235 159 0.68 - V7n N18 50 32 0.64 A2_5 A2 N14 67 31 0.46 A2_1 A2 Line 2 Subtotal (1% of total) 352 (8%) 222 (23%) 0.63 Service Line 3 (total = 107) 751 92 0.12 G42 45 14 0.31 A2_1 A2 G107 10 5 0.50 A2_4 A2 Line 3 Subtotal (2% of total) 55 (7%) 19 (21%) 0.35 Grand Total 16,384 3,173 0.19 19 Table 9: A small number of support teams in each service line are responsible for the majority of ping pong events and often also have a high frequency of ping pong. The top 5% of support teams by total ping pongs for Service Line 1, the three support teams from the top 5% of Service Line 2 (by total ping pongs) that have the highest ping pong frequency, and the top two support teams by total ping pong from Service Line 3. Excluded: Service Line 2.5 An examination of total ping pongs by support team identified a strong concentration of ping pongs (Table 9). 73% of ping pongs are attributable to a 5% of STs. Segmenting further, the top 1% of STs (D4, G97, D5, D8, D2, V72 2nd, and D7) are responsible for 43% of the total ping pong events (1379 ping pongs, 21% of all transfers). These support teams generally also have high ping pong frequencies (Table 9). The six of the seven STs in the top 1% belong in Service Line 1, which is responsible for 68% of all ping pongs. However, the trend of a few service teams accounting for the majority of ping pong events holds true in each service line. This demonstrates that a small number of support teams are responsible for a majority of the ping pong events across several dimensions. These teams also are strongly associated with functional division A2_1 and Org Line A2. Future analysis should focus on the roles, functions, organization, and connections between these groups, determine the root cause of ping pongs, and optimize incident management processes. Ping Pong Activity by Product We also examined ping pongs by product to determine which products are most affected by ping pong. Of the 701 named products, 255 products had at least one incident with a ping pong event, which we segmented by ping pong frequency into deciles. Since we observed that the ping pong frequencies of the top decile are dramatically higher (1.6-fold from the 9th to 10th), we focused our analysis on the constituent 21 products (in order of frequency from 0.80-0.57, Products 510, 799, 736, 141, 775, 303, 158, 727, 137, 157, 398, 97, 542, 789, 776, 159, and 558). Short-term change efforts should be focused on the teams that contribute the most ping pongs to this top decile of products. We compare the ping pong frequency of STs for a product with the product’s average ping pong frequency. A subset of STs repeatedly exhibited above average frequencies–D4, D5, D2, D7, D6, D8, D1, G97 and G57 2nd all exhibited ping pong frequencies above the product average for at least two products (10, 8, 6, 5, 3, 3, 2, 2, and 2 products, respectively). This is perhaps not surprising given that many teams work on the same products. This analysis arrives at a similar conclusion to the STs analysis of ping pongs by identifying the same teams as most responsible for ping pong behavior. The key next step would be to engage these teams to understand their contribution to incident resolution such that these ping pongs can be categorized into essential and nonessential (and therefore noncompliant). Furthermore, additional information about the products–especially product relationships and hierarchies–would also be beneficial in determining whether specific products or product groups have qualities that predispose them to ping pong behavior. 20 Arjel D. Bautista, Syed M. Kumail Akbar, Anthony Alvarez, Tom Metzger, Marshall Louis Reaves Discussion of Ping Pong Analysis Through our analysis we have identified that a small number of support teams disproportionately contribute to the total number of ping pong events and the 38,366 hours of In Progress work time. From this information we believe that Volvo IT could conservatively reduce total support team work time by 10,600 hours (28%). We propose that a targeted initiative focusing on the eight support teams from Service Line 1, the three teams from Service Line 2 and the two support teams in Service Line 3 listed in Table 9 which are most responsible for ping pongs. Together these teams have a ping pong frequency of 0.65. By reducing this rate by one third to 0.4 (still double the average frequency across all support teams) Volvo IT would eliminate 700 ping pongs a month (given that this data set is representative of a typical month), which, with an average of 2.66 ping pongs per incident with ping pongs, equates to 236 incidents worth of ping pongs. This reduction would save 40 hours of In Progress time per incident for a total of 10,600 hours for the month. Furthermore, this analysis does not address the top decile of products, shown in Figure 9, where other initiatives could yield substantial additional savings. A properly designed initiative would determine the root cause of ping pongs through further analysis of roles, functions, organization, and connections between the support teams, and org lines. Additionally, as certain org lines and functions are disproportionately represented by the support teams identified in our analysis such an initiative must be aimed at the appropriate organizational level in order to affect its intended outcome. 4.4 Wait-User Analysis Our Understanding of the Wait-User Issue According to the BPIC 2013 documentation, Wait-User is a sub status (under the Accepted category) used by Action Owners to manually “stop the clock” on a particular case in order to decrease the total turnaround time for completion of a task. While there are certainly some legitimate uses for this sub status (such as waiting for information or action from a user), some owners are suspected of blatantly misusing it as a means of improving their own performance metrics in the incident and problem management systems. Cases that include the use of Wait-User status have a 7-fold longer case duration with 20% of the additional case time due to Wait-User (Figure 10). 21 Figure 10: Wait-User cases are 7-fold longer in average duration with 17% of the increase due to Wait-User time. Source of change analysis between average incidents without Wait- User and average incidents with Wait-User. Average times are in days and the additional times are the difference between the average time non-Wait-User cases. In order to understand the use of Wait-User time by individual action owners, we investigated its usage across the various STs, Org Lines, Functions and owners by country. For this analysis we subset the data and only examined cases that included the Wait-User sub status. We calculated the average usage of the Wait-User sub status per case as well as the total time that a case spent in this sub status. Wait-User Distribution by Support Team To evaluate ST performance with regards to the use of Wait-User, we analyzed the average frequency of Wait-User usage and the duration of Wait-User by team. Teams that consistently use Wait-User and do so for longer periods of time across many products have a disproportionately high impact upon 22 Arjel D. Bautista, Syed M. Kumail Akbar, Anthony Alvarez, Tom Metzger, Marshall Louis Reaves Figure 11. Average Wait-User frequency and duration for STs that work on the most products. Wait-User duration is plotted on a log scale and the size of the circle is proportionate to the number of products on which the ST has worked. From this analysis we observe that STs in Service Line 1 frequently use Wait-User and for longer durations than their peers. While it is difficult to conclude that these instances are definitely related to abuses of the Wait-User functionality, any meaningful examination of this phenomenon would surely begin with an investigation of these five support teams. We conducted an analysis of STs by product to identify teams that disproportionately impact average Wait-User time at the product level. We identified 8 teams that have an average Wait-User time more than 7 days above the average for that product on multiple products (Table 10). Count of Products Where ST’s Mean Wait-User ST Duration Exceeds Overall Average by ≥7 days (Percentage of all such cases) N49 3 (5.08%) S24 3 (5.08%) G96 2 (3.4%) G97 2 (3.4%) M10 2 (3.4%) 23 N25 2 (3.4%) S41 2 (3.4%) S55 2 (3.4%) Table 10: STs underperforming other teams’ Wait-User averages by ≥7 days The combination of the above analyses demonstrate that two STs, G96 and G97 (highlighted in Figure 11 and Table 10), not only have the highest number of deviations from the average Wait-User time when all products are considered, but also exhibit Wait-User durations that exceed their peers by more than a week. However, in order to draw conclusions regarding legitimate use of Wait User time, we would require more specific information about the process requirements, ST capabilities, and the formal guidelines regarding proper usage of the Wait-User sub status. Wait-User Distribution by Org Line Analysis of Wait-User activity for the different Org Lines exhibited more consistency in the frequency of Wait-User usage and in the mean duration of total Wait-User time per case. The Org Lines that had the highest frequency of Wait-User usage had a lower mean duration of Wait-User usage although the variance of the distribution of these frequencies was low. Org lines C and A2 have the highest usage of Wait-User time but a lower mean duration compared to V2 and B, both of which have fewer occurrences. Org Line C has higher overall frequency of Wait-User usage, whereas Org Line A2 has a longer average duration per instance of Wait-User. This may be due to the initial assignment of more complex, elevated start cases to Org Line A2 as described in our push-to-front analysis. Wait-User Distribution by Support Team Function / Division We analyzed Wait-User activity by Function assess usage at this organizational level. These groups also exhibited consistency in the frequency of Wait-User usage and in the mean duration of total Wait-User time per case similar to what we observed in our analysis of Org Lines. Only A2_3 had a substantially longer average Wait-User time per case. Interestingly, the three functional divisions with the highest frequency of Wait-User usage are generally the initial group assigned to routine cases exhibiting push-to-front type behavior. Wait-User Distribution by Location To understand the usage of Wait-User by action owners by location we examined the frequency of Wait-User usage and in the mean duration of total Wait-User time per case. As with our analyses of Org Lines and functions, it appears that owners who use the Wait-User status the most often are not the ones who use it for the longest 24 Arjel D. Bautista, Syed M. Kumail Akbar, Anthony Alvarez, Tom Metzger, Marshall Louis Reaves duration. While high usage is consistent across Sweden, the owners using the option the most in Sweden do not have the highest average Wait-User time per case. Discussion of Wait-User Analysis The use of Wait-User option correlates well with significantly longer case durations, however the use of the option stops the clock from ticking as far as turnaround time for a case is concerned. This allows for the option to be open to abuse at the user level. Our analysis of Wait-User activity demonstrated clear outliers at the ST, Org Line and Owner-Country level. ST’s G97, G96, G230 2nd, D7 and D8 use the Wait- User option the most across the most products and use it for durations longer than other ST’s working on the same products. Org Lines that use the option the most exhibit lower durations of Wait-User usage. Org Line C has experiences higher usage than Org Line A2, which may be indicative of initial assignment of more complex elevated start cases to A2. 5 Conclusions Our analysis has identified several performance improvement opportunities in the IT incident management process for Volvo, Belgium. Such improvements would increase resource efficiency and decrease customer impact through reduction of case resolution times. We believe the potential performance gains are substantial and warrant further investigation and analysis to develop specific action plan to realize such gains. We discovered the standard process maps for both incidents and problem across Org Lines A2, Org Line C, and the other Org Lines. From these maps, we identified 1,800 Help Desk calls belonging exclusively to Org Line C, and separated these from tickets not Completed–In Call. Excluding calls, we found similar processes performed in Org Line C and A2, which indicates some degree of standardization between these Org Lines. This suggests the presence of a business reference model–likely captured in a common language (e.g. business process model and notation)–that provides a focal point for the process modifications this study recommends. In the absence of a reference model, the processes we discovered could be easily translated into formats easily leveraged by internal process owners. We were able to identify process inefficiencies related to push-to-front, Wait-User abuse, and ping pong. In general, substantial opportunity for improvement is concentrated in a small number of support teams. The identified noncompliance of small group can be further investigated and gains realized in the near-term. We observed a strong correlation between push-to-front activity and incident frequency for products–that is products with more incidents are less likely to be escalated. This tendency to keep more frequent products at Service Line 1 indicates a potential to better leverage knowledge management practices for less frequent 25 products. A learning curve appears to exist for product-specific solutions to be delivered by dedicated first line resources–the more incidents related to a product, the more likely the product is to remain at the 1st line. We hypothesize that there is a knowledge sharing mechanism (formal or informal) in place for the most commonly occurring incidents. VINST documentation indicates recording solutions to incidents and problems is an optional step [5] in the process. Creating incentives to capture such solutions formally is likely to help improve push-to-front behavior and reduce ping pong behavior. The prevalence of ping pong behavior appears to cause significantly decreased resource efficiency and increased resolution times. We recommend further investigation into specific STs to segment ping pongs required for incident or problem resolution. Given the large size and concentration of this opportunity, there is clear opportunity to reduce ping pongs by identifying and addressing team-specific root causes. Our analysis of Wait-User activity demonstrated clear outliers at the ST, Org Line and Owner-Country level. Among ST’s G97, G96, G230 2nd, D7 and D8 stood out as teams using the Wait-User option the most. These teams stood out in terms of having selected Wait-User for the most products and for the longest durations among ST’s working on the same products. While outlier behavior is apparent from our investigations, any inquiry into the legitimate or illegitimate use of Wait-User behavior requires a better understanding of the roles of the different ST’s, Org Lines, Owners and the different Products lines. We identified 255 cases (c. 3%) with total duration over 50 days. In general, we excluded from analyses when considering total duration. These cases could represent data quality or logging issues because the majority of time appears to be spent Queued–Awaiting Assignment (i.e. resources resolved a case, but failed to record the resolution). However, if engagement with involved Action Owners revealed these cases are reflective of ongoing issues, understanding the details of these cases would provide substantial value for improving SLA performance (and accuracy) and reducing customer impacts. Overall, given the requirement of making data publicly available, Volvo shared very limited data about the incidents and problems. We understand the limitation placed on sharing any additional data given risks with sharing confidential information. However, if similar analyses are carried out with access to relevant details about incidents, products, problems, resources and the organization structure, and performance expectations; one could build highly actionable process / operations change recommendations to drive meaningful performance improvement. Our conviction in the power of process mining of event logs, in combination with details about the work items and the overall operational set up to yield powerful insights has grown further with our participation in this year’s BPIC challenge. We thank Volvo for making the data available and the organizers of this competition for allowing us it participate. 26 Arjel D. Bautista, Syed M. Kumail Akbar, Anthony Alvarez, Tom Metzger, Marshall Louis Reaves Acknowledgements: We thank Lalit Wangikar, Nick Hartman and Eric Chung for their discussion and helpful suggestions. Contributions: AB performed Push to Front analysis, SA performed Wait-User analysis, TM performed ping pong analysis, and AA performed process conformity analysis. AB, SA, AA, TM and MLR interpreted data and composed manuscript. References 1. Ferreira, D. and Mira da Silva, M., Using process mining for ITIL assessment: a case study with incident management. In: Proceedings of the 13th Annual UKAIS Conference, Bournemouth University. 2. Jochen De Weerdt, Seppe K. L. M. vanden Broucke, Jan Vanthienen, Bart Baesens: Leveraging process discovery with trace clustering and text mining for intelligent analysis of incident management processes. IEEE Congress on Evolutionary Computation 2012: 1-8 3. VINST data set and description. Jun 14, 2012. http://www.win.tue.nl/bpi2013/lib/exe/fetch.php?media=vinst_data_set.pdf. Accessed: Jul 10, 2013 4. “BPIC Challenge 2013.” Flux Capacitor. Fluxicon, Jun 1, 2013. Accessed: Jul 11, 2013. 5. VINST User Manual. Jun 14, 2013. http://www.win.tue.nl/bpi2013/lib/exe/fetch.php?media=vinst_manual.pdf. Accessed: Jul 11, 2013.