Judith Michael, Victoria Torres (eds.): ER Forum, Demo and Posters 2020 163 Identifying Cohorts that Differ in their Behaviour: Tool Support Sander J. J. Leemans1 , Shiva Shabaninejad2 , Kanika Goel1 , Hassan Khosravi2 , Shazia Sadiq2 , and Moe T. Wynn1 1 Queensland University of Technology, Brisbane, Australia 2 University of Queensland, Brisbane, Australia {s.leemans,k.goel,m.wynn}@qut.edu.au, {s.shabaninejad, h.khosravi}@uq.edu.au, shazia@itee.uq.edu.au Abstract. Process mining is a specialised form of data analytics that aims to provide data-driven improvement recommendations, derived from event logs. These event logs contain information about the execution of real-world processes, which may be complex. Cohort identification rec- ommends drill-down filters for process mining, based on differences in process. In this paper, we describe its integration in three process min- ing tools: as a stand-alone ProM plug-in, as part of the visual Miner and (planned) as part of Course Insights. Keywords: Process Mining · Feature Selection · Filter Recommenda- tion · Stochastic Comparative Process Mining 1 Introduction Process mining, a specialised form of data analytics, provides techniques using which analysts can extract insights from recorded process behaviour in event logs. The insights are used to provide data-driven recommendations to improve business operations. Many real-life processes are complex in nature, and studying their process models is challenging [1]. Process mining techniques to deal with this complexity include filtering, slicing and dicing, and process cubes. Cohort identification aims to identify and recommend sub-sets of the traces in the log (cohorts). These cohorts are defined by attributes of traces (e.g. “claim amount”, “gender” or “country”), and cohort identification recommends the at- tributes and values such that the traces that have the attribute and value (co- hort) differ as much as possible from the traces that do not have the attribute or have a different value (anti-cohort) in terms of the process that is being fol- lowed, which includes the order of steps taken for traces, and how often different sequences appeared in the (anti-)cohort. The difference between the cohort and the anti-cohort is expressed as a distance measure [4], where 1 means that their processes are completely different (i.e. no activity appears in both), and 0 means that their processes are no more different than a random division of the combined log. Copyright © 2020 for this paper by its authors. Use permitted under Creative Commons License Attribution 4.0 International (CC BY 4.0). 164 Sander J. J. Leemans et al. That is, cohort identification finds groups of cases in event logs that follow a process that is different than the process of other groups. For instance, con- sider the following log consisting of 1000 cases of customers purchasing online and registering for an account, in which each trace is annotated with whether the customer was a Silver or Gold customer, from an East or West branch: [hregister, purchasei200 100 SE , hregister, purchaseiSW , hregister, purchaseiGE , 50 hregister, purchasei100 GW , hpurchase, registeri 100 SE , hpurchase, registeri 50 SW , hpurchase, registeri200 GE , hpurchase, registeri100 GW ]. In this log, Gold customers 150 executed the trace variant hregister, purchasei 450 = 13 times, while for the other 450 customers this is 1000 = 0.45 times. Thus, the likelihood that Gold customers first register is lower than for other customers. Cohort identification would assess this for all potential combinations of attributes and values, and provide a ranked list based on a quantification of such differences. Cohort identification has similar goals as other process comparison techniques such as trace clustering [10], concept drift detection [7], and event attribute clustering [2], however provides better explainable results: the output is a list of attribute-value pairs that denote sub-logs of interest. The details of our cohort identification technique are described in [6]. In this paper, we describe the in- tegration of cohort identification into three existing open source data intelli- gence tools: as a plug-in of the ProM framework (Section 2), as an exten- sion of the process mining tool visual Miner (Section 3), and as an extension of the learning analytics dashboard Course Insights (Section 4). The first tool demonstrates the use of cohort identification as a stand alone tech- nique, the second tool depicts the use of cohort identification in conjunction with other process mining techniques, and the third tool illustrates the em- bedding of cohort identification in a learning analytics context. A screen- Fig. 1: Cohort Identification in ProM. cast is available at3 . 2 Stand-alone Plug-in of ProM The ProM framework [3] is a state-of-art open-source process mining frame- work aimed at practitioners and academics. Cohort identification has been imple- mented as a plug-in of ProM enabling practitioners and academics to access the technique. Upon starting with as input an event log, two parameters can be set: the attribute that determines the activity being executed, and the maximum number of trace attributes that is exhaustively considered. Upon completion, 3 https://vimeo.com/442323972 Identifying Cohorts that Differ in their Behaviour: Tool Support 165 Fig. 2: The visual Miner (left) and the integration of cohort identification (right). first a diversified set of cohorts is shown (with attribute, value, cohort size, and distance between the cohort and anti-cohort). Figure 1 shows a screenshot. Implementation. The implementation is flexible, as it provides extension points to (1) elicit attribute value ranges, (2) measure distance between events, traces and logs, and (3) truncate cohorts based on size or other aspects. Furthermore, the implementation is multithreaded and for pruning stores a Map entry, an int[] and an AtomicBoolean for each attribute value range combination. Maturity & How to Access. Cohort identification is open source and is part of the ProM 6.10 release; see4 . The plug-in has been successfully applied in three case studies [6]. 3 visual Miner The visual Miner (vM) [5] is an existing process mining tool that enables end- users to combine advanced academic process mining techniques in an industry- capable and user-friendly package. The input of vM is an event log. First, vM applies a process discovery technique to the log to obtain a process model. Sec- ond, vM applies a conformance checking technique and visualises the differences between log and the discovered model. Third, it computes detailed frequency and performance information, and visualises this on the model, amongst others using animation. Fourth, it allows the user to drill down and focus on parts of the event log that are of interest, by applying one of several filters. Settings to any of these techniques and filters can be changed at any time, and vM will update and redo the necessary steps automatically [5]. A screenshot is shown in Figure 2; for a complete overview of vM’s features, please refer to5 . Cohort Identification (new). While the vM makes it easy to drill down into parts of the log or process of particular interest, it was up to the user to manually analyse the visualisation available to discover potentially interesting parts. 4 http://promtools.org 5 http://leemans.ch/publications/ivmProM6.10.pdf 166 Sander J. J. Leemans et al. Selected Attributes Minimum Coverage: Recommendations Program, Residential Status 20% 2 Coverage Distance Program = Tourism 45 % 50.06% Program = Engineering and Residential Status = Domestic 45 % 25.80% CLOSE APPLY FILTER Fig. 3: Cohort identification in Course Insights. Cohort identification suggests filters on attributes on the traces in an event log, such that applying the filter leads to the largest differences in process. Fig- ure 2 shows a screenshot of the integration of cohort identification in vM: the cohorts are computed automatically in the background, and the result is shown to the user. The first column shows the trace attribute of the cohort, the second column the values of the attribute that are in the cohort, the third column the number of traces in the cohort, while the last column shows the distance between the cohort and the anti-cohort. Using a click (for the cohort) or shift+click (for the anti-cohort) one can quickly filter down the event log to the corresponding traces, in order to study the differences in process in more detail. The identified cohorts can be exported to an Excel document for further analysis. Embedding cohort identification in vM makes it easy for end-users to conduct detailed pro- cess mining analysis using one plug-in. Maturity & How to Access. vM has been applied to many process mining projects by industry partners and academics (see 6 for an overview). Cohort iden- tification has been added in April 2020 and has been successfully applied in three case studies [6]. Visual Miner is open source, part of the ProM framework [3] and can be downloaded from6 . 4 Course Insights Course Insights is an instructor-facing learning analytics dashboard, devel- oped at the University of Queensland, that empowers course coordinators to gain insights and act on student data to enhance student learning and experi- ence across the course life-cycle at scale. It collates student data from a variety of learning systems and sources and displays it to instructors all in one sim- ple and easy to use interface. An essential element is its comparative analysis functionality, which enables course coordinators to use filters to compare and contrast different student groups based on their demographics, enrolment, en- gagement and performance data. An observational study that analysed how the filters were used by 71 staff members found that commonly only a small subset of the features was used and filters were rarely applied on top of one another [8]. To overcome this challenge, we are implementing cohort identification in Course Insights to recommend insightful filters to instructors [9]. 6 http://visualminer.org Identifying Cohorts that Differ in their Behaviour: Tool Support 167 Cohort Identification (planned). Figure 3 illustrates the proposed presenta- tion of filter recommendations to instructors, including the attributes and values, coverage (fraction of students covered by the filter), and distance (insightfullness of the filter). Maturity & How to Access. A case study based on data from a course with 875 students, with high demographic and educational diversity has explored the potential benefits of applying cohort identification to Course Insights [9]. The cohort identification is planned to be implemented in Course Insights, which can be accessed via7 . A remaining challenge is to make Course Insights fully process aware: with cohort identification, users can drill down into sub-groups with differences in their process, however more support to study these differences is necessary. 5 Conclusion In this paper, we described how cohort identification, which recommends trace-attribute-based filters to maximise the differences between processes, is implemented as a stand-alone ProM plug-in, is integrated in the visual Miner, and is being integrated in Course Insights. The technique filters sub-logs of traces (cohorts) defined by trace attribute value ranges (features) to compare behavioural differences in cohorts or to drill down into a particular cohort. The technique can be used to understand the differences between two cohorts and answer questions related to a particular cohort. Cohort identification can be applied in reasonable time to event logs. In the future, we intend to focus imple- menting process infrastructure in Course Insights and on automated comparison techniques to compare the identified cohorts. References 1. van der Aalst, W.M.P.: Process Mining - Data Science in Action (2016) 2. Bolt, A., van der Aalst, W.M.P., de Leoni, M.: Finding process variants in event logs (short paper). In: CoopIS. vol. 10573, pp. 45–52 (2017) 3. van Dongen, B.F., et al.: The ProM framework: A new era in process mining tool support. In: Petri Nets. pp. 444–454 (2005) 4. Leemans, S.J.J., Syring, A.F., van der Aalst, W.M.P.: Earth movers’ stochastic conformance checking. In: BPM forum. pp. 127–143 (2019) 5. Leemans, S.J.J., et al.: Process and deviation exploration with Inductive visual Miner. In: BPM demos. vol. 1295, p. 46. CEUR-WS.org (2014) 6. Leemans, S.J.J., et al.: Identifying cohorts: Recommending drill-downs based on differences in behaviour for process mining. In: ER (2020) 7. Maaradji, A., Dumas, M., Rosa, M.L., Ostovar, A.: Detecting sudden and gradual drifts in business processes from execution traces. TKDE 29(10), 2140–2154 (2017) 8. Shabaninejad, S., et al.: Automated insightful drill-down recommendations for learning analytics dashboards. In: LAK. p. 41–46 (2020) 9. Shabaninejad, S., et al.: Recommending insightful drill-downs based on learning processes for learning analytics dashboards. In: AIED. pp. 486–499 (2020) 10. Weerdt, J.D., vanden Broucke, S.K.L.M., Vanthienen, J., Baesens, B.: Active trace clustering for improved process discovery. TKDE 25(12), 2708–2720 (2013) 7 https://analytics.itali.uq.edu.au/dev/insights